statwiki - User contributions [US]

CRITICAL ANALYSIS OF SELF-SUPERVISION

2020-11-30T11:20:39Z

A4moayye: /* References */

== Presented by ==
Maral Rasoolijaberi

== Introduction ==

This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image.

The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation.
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. In the rotation task for example, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.

[[File:self-sup-rotation.png|700px|center]]

[[File:intro.png|500px|center]]

== Previous Work ==

In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.

A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.

== Method & Experiment ==

In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image.

With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.
The same experiment has been done using the CIFAR10/100 dataset.

== Results ==

Figure 2 shows how well representations at each level are linearly separable.
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.

[[File:histo.png|500px|center]]

== Source Code ==

The source code for the paper can be found here: https://github.com/yukimasano/linear-probes

== Conclusion ==

This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.

== References ==

[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019

[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.

[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018

[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149

[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.

[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.

CRITICAL ANALYSIS OF SELF-SUPERVISION

2020-11-30T11:20:17Z

A4moayye: /* Introduction */

CRITICAL ANALYSIS OF SELF-SUPERVISION

2020-11-30T11:16:23Z

A4moayye: /* Introduction */

== Presented by ==
Maral Rasoolijaberi

== Introduction ==

This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image.

The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation.
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as rotation estimation. For example, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis[3], as it can be seen in the figure below.

[[File:self-sup-rotation.png|700px|center]]

[[File:intro.png|500px|center]]

== Previous Work ==

In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.

A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.

== Method & Experiment ==

In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image.

With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.
The same experiment has been done using the CIFAR10/100 dataset.

== Results ==

Figure 2 shows how well representations at each level are linearly separable.
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.

[[File:histo.png|500px|center]]

== Source Code ==

The source code for the paper can be found here: https://github.com/yukimasano/linear-probes

== Conclusion ==

This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.

== References ==

[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019

[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.

[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018

[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149

[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.

CRITICAL ANALYSIS OF SELF-SUPERVISION

2020-11-30T11:15:53Z

A4moayye: /* Introduction */

== Presented by ==
Maral Rasoolijaberi

== Introduction ==

This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image.

The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation.
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as rotation estimation. For example, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below[3].

[[File:self-sup-rotation.png|700px|center]]

[[File:intro.png|500px|center]]

== Previous Work ==

In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.

A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.

== Method & Experiment ==

In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image.

With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.
The same experiment has been done using the CIFAR10/100 dataset.

== Results ==

Figure 2 shows how well representations at each level are linearly separable.
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.

[[File:histo.png|500px|center]]

== Source Code ==

The source code for the paper can be found here: https://github.com/yukimasano/linear-probes

== Conclusion ==

This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.

== References ==

[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019

[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.

[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018

[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149

[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.

File:self-sup-rotation.png

2020-11-30T11:14:10Z

A4moayye:

Self-Supervised Learning of Pretext-Invariant Representations

2020-11-30T10:39:33Z

A4moayye: /* Introduction */

==Authors==

Ishan Misra, Laurens van der Maaten

== Presented by ==
Sina Farsangi

== Introduction ==

Modern image recognition and object detection systems find image representations using a large number of data with pre-defined semantic annotations. Some examples of these annotations are class labels [1] and bonding boxes [2] as shown in Figure 1. For finding representations using pre-defined semantic annotations, there is a need for large number of labeled data which is not the case in all scenarios. Also, these systems usually learn features that are specific for a particular type of classes and not necessarily semantically meaningful features that can help to generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community to find image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast growing areas of research that tries to address this problem is '''Self-Supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show, there is no need for using class labels or bounding boxes in self-supervised learning.

[[File: SSL_1.JPG | 800px | center]]
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div>

Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of used transformation. Two of the most used pretext tasks are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. Also, in jigsaw task which is more complicated than rotation task, unlabeled images are cropped into 9 patches and then, the image is perturbed by randomly permuting the nine patches. Each permutation falls into one of the 35 classes according to a formula. Then, a deep network is trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to greyscale, and image reconstruction, where a square chunk of the image is deleted and the model tries to reconstruct that part.

[[File: SSL_2.JPG |1000px | center]]
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div>

Although the proposed Pretext Tasks have obtained promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformations characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations the are common between the original images and the transformed ones. This idea is supported by the fact that humans are able to recognize these transformed images. This hints us to try to develop a method that obtains image representations that are common between the original and transformed images, in other words, image representations that are transformation invariant. The summarized paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that learns to obtain Self-Supervised image representations that as opposed to Pretext Tasks are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.

== Problem Formulation and Methodology ==

[[File: SSL_3.JPG | 800px | center]]
<div align="center">'''Figure 3:''' Figure 3: Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div>

An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image ,<math>I</math>, in the Dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied:

\begin{align} \tag{1} \label{eqn:1}
I^t=\tau(I)
\end{align}

Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:

\begin{align} \tag{2} \label{eqn:2}
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t)
\end{align}

As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two set of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:

\begin{align} \tag{3} \label{eqn:3}
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})
\end{align}

Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below:

\begin{align} \tag{4} \label{eqn:4}
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t}}{\tau}) \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t}}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}}}{\tau}) \biggr)}
\end{align}

where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers) , <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:

\begin{align} \tag{5} \label{eqn:5}
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]
\end{align}

[[File: SSL_4.JPG | 800px | center]]
<div align="center">'''Figure 4:''' Proposed PIRL </div>

Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math> , increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, are increased. During training a memory bank [], <math>m_I</math>, of dataset image representations are used to access the representations of the dataset images including the negative images. The proposed PIRL model is shown in Figure (4). Finally, the contrastive loss in equation (5) does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:

\begin{align} \tag{6} \label{eqn:6}
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))
\end{align}
Where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.

==Experimental Results ==

For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from ImageNet dataset. Also, the number of negative images used for PIRL is N=32000.

===Object Detection===

For object detection, a Faster R-CNN[] model is used with a ResNet-50 backbone which is pre-trained using PIRL and other Self-Supervised methods. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised based methods, '''for the first time it outperforms Supervised Pretraining on object detection'''.

[[File: SSL_5.PNG | 800px | center]]
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div>

===Image Classification with linear models===

In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the ResNet-50 pretrained model is fixed and used as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results show that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behinds Supervised Pretrained Learning.

[[File: SSL_6.PNG | 800px | center]]
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div>

Overall, the results show that PIRL performs best among different Self-Supervised Learning methods. Even, it is able to perform better than Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.

==Analysis==

===Does PIRL learn invariant representations?===

In order to show that the image representations obtained using PIRL are invariant, several images are chosen from ImageNet dataset and representations of the chosen images and their transformed version are obtained using one time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed images representations. Therefore, PIRL learns invariant representations.

[[File: SSL_7.PNG | 800px | center]]
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div>

===What is the effect of <math>\lambda</math> in the PIRL loss function?===

In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5.

[[File: SSL_8.PNG | 800px | center]]
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div>

===What is the effect of the number of image transforms?===

As another experiment, the authors investigated the number of image transforms and its effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL is able to use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using VOCC07 dataset.

[[File: SSL_9.PNG | 800px | center]]
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div>

===What is the effect of the number of negative samples?===

In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure (10), increasing the number of negative sample results in richer image representations and higher classification accuracy.

[[File: SSL_10.PNG | 800px | center]]
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div>

==Generalizing PIRL to Other Pretext Tasks==

The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy on image classification task.

[[File: SSL_11.PNG | 800px | center]]
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div>

==Conclusion==

In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.

==Critiques==

The paper proposes a very nice method on obtaining transformation invariant image representations. However, the authors can extend their work with richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering based methods [7,8]. That may result in better image representations.

== Source Code ==

https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant

== References ==

[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015.

[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017

[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.

[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.

[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.

[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.

[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.

Extreme Multi-label Text Classification

2020-11-24T12:21:32Z

A4moayye: /* APLC-XLNet */

== Presented By ==
Mohan Wu

== Introduction ==
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arises from the fact that the label set is so large that most models give poor results. The authors propose a new model called APLC-XLNet which fine tunes the generalized autoregressive pretrained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and achieved results far better than existing state-of-the-art models.

== Motivation ==
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2] and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] on the XMTC problem. The final challenge is the nature of the labelling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to create APLC.

== Related Work ==
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.

=== BOW Approaches ===
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. However, due to the large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights to induce sparsity but still this method is quite expensive. Another approach is to simply apply a dimensional reduction technique on the label space. However doing so have shown to have serious negative effects on the prediction accuracy. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach have shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.

=== Deep Learning Approaches ===
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.

== APLC-XLNet ==
APLC-XLNet consists of three parts: the pretrained XLNet-Base as the base, the APLC output layer and a fully connected hidden layer connecting the pool layer of XLNet the output layer, as it can be seen in figure 1. One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contains the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.

[[File:Capture1111.JPG |center|600px]]

<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div>

The authors define the probability of each label as follows:

\begin{equation}
p(y_{ij} | x) =
\begin{cases}
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t
\end{cases}
\end{equation}

where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:

\begin{equation}
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))
\end{equation}
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.

The number of parameters in this model is given by:
\begin{equation}
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)
\end{equation}
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:
\begin{align}
C &= C_h + \sum_{i=1}^K C_i \\
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))
\end{align}
where <math> N_b </math> is the batch size.

=== Training APLC-XLNet ===
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[8] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[8] defined as:
\begin{equation}
\eta =
\begin{cases}
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w
\end{cases}
\end{equation}
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.

== Results ==
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art models. The evaluation metric, P@k is defined as:
\begin{equation}
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i
\end{equation}
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.

[[File:Paper.PNG|1000px|]]

To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.

[[File:XTMC2.PNG|1000px|]]

The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.

== Conclusion ==
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.

== Critques ==
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient.

== References ==
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `
Khudanpur, S. Extensions of recurrent neural network
language model. In 2011 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP),
pp. 5528–5531. IEEE, 2011.

[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings
of the Thirteenth International Conference on Artificial
Intelligence and Statistics, pp. 137–144, 2010.

[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss
functions for recommendation, tagging, ranking & other
missing label applications. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 935–944. ACM, 2016.

[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),
2019a.

[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016.

[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´
cient softmax approximation for gpus. In Proceedings of
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017

[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from
Transformers. In NeurIPS Science Meets Engineering of
Deep Learning Workshop, 2019.

[8] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018

Extreme Multi-label Text Classification

2020-11-24T11:58:24Z

A4moayye:

== Presented By ==
Mohan Wu

== Introduction ==
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arises from the fact that the label set is so large that most models give poor results. The authors propose a new model called APLC-XLNet which fine tunes the generalized autoregressive pretrained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and achieved results far better than existing state-of-the-art models.

== Motivation ==
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2] and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] on the XMTC problem. The final challenge is the nature of the labelling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to create APLC.

== Related Work ==
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.

=== BOW Approaches ===
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. However, due to the large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights to induce sparsity but still this method is quite expensive. Another approach is to simply apply a dimensional reduction technique on the label space. However doing so have shown to have serious negative effects on the prediction accuracy. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach have shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.

=== Deep Learning Approaches ===
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.

== APLC-XLNet ==
APLC-XLNet consists of three parts: the pretrained XLNet-Base as the base, the APLC output layer and a fully connected hidden layer connecting the pool layer of XLNet the output layer, as it can be seen in figure 1. One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contains the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.

[[File:Capture1111.JPG |center|600px]]

<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div>

The authors define the probability of each label as follows:

\begin{equation}
p(y_{ij} | x) =
\begin{cases}
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t
\end{cases}
\end{equation}

where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:

\begin{equation}
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))
\end{equation}
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.

The number of parameters in this model is given by:
\begin{equation}
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)
\end{equation}
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:
\begin{align}
C &= C_h + \sum_{i=1}^K C_i \\
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))
\end{align}
where <math> N_b </math> is the batch size.

=== Training APLC-XLNet ===
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[8] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule defined as:
\begin{equation}
\eta =
\begin{cases}
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w
\end{cases}
\end{equation}
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps.

== Results ==
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art models. The evaluation metric, P@k is defined as:
\begin{equation}
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i
\end{equation}
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.

[[File:Paper.PNG|1000px|]]

To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.

[[File:XTMC2.PNG|1000px|]]

The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.

== Conclusion ==
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.

== Critques ==
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient.

== References ==
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `
Khudanpur, S. Extensions of recurrent neural network
language model. In 2011 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP),
pp. 5528–5531. IEEE, 2011.

[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings
of the Thirteenth International Conference on Artificial
Intelligence and Statistics, pp. 137–144, 2010.

[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss
functions for recommendation, tagging, ranking & other
missing label applications. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 935–944. ACM, 2016.

[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),
2019a.

[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016.

[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´
cient softmax approximation for gpus. In Proceedings of
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017

[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from
Transformers. In NeurIPS Science Meets Engineering of
Deep Learning Workshop, 2019.

[8] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018

File:Capture1111.JPG

2020-11-24T11:55:51Z

A4moayye:

DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION

2020-11-24T11:25:11Z

A4moayye:

== Presented by ==
Bowen You

== Introduction ==

Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning, and it refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance. The code for this paper is freely available at https://github.com/google-research/dreamer.

=== Preliminaries ===

This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math> as feedback.

The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>.
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively. Informally, <math>V_{\pi}</math> tells one how good a state is in terms of the expected return when starting in the state <math>s</math> and then following the policy <math>\pi</math>. Similarly <math>Q_{\pi}</math> gives the value of the expected return starting from the state <math>s</math>, taking the action <math>a</math>, and subsequently following the policy <math>\pi</math>.

Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that
\[
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)
\]

=== Feedback Loop ===

Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence
\[
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T
\]

== Motivation ==

In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates
\[
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots
\]

By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained.

== Dreamer ==

The authors of the paper call their method Dreamer. In a high-level perspective, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below.

[[File: dreamer_overview.png | 800px]]

Let's look at Dreamer in detail. It consists of:
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math>
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math>
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math>
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math>
* Value <math display="inline"> v_{\psi}(s_t)</math>

where <math>o_{t}</math> is the observation at time <math>t</math> and <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.

There are three main components to the proposed algorithm:
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.

The proposed algorithm is described below.

[[File:dreamer.png|frameless|500px|Dreamer algorithm]]

Notice that there are three neural networks that are trained simultaneously.
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.

== Results ==

The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Using the same hyper parameters for all tasks, Dreamer exceeds previous model-based and model-free agents in terms of data-efficiency, computation time, and final performance and overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.

[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]

== Conclusion ==

This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment. Also, as a future work on representation learning, the ability to scale latent imagination to environments of higher visual complexity can be investigated.

== Critique ==
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.

In the model components in Appendix A, they have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.

== References ==

[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.

[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.

[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.

DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION

2020-11-24T11:15:30Z

A4moayye:

== Presented by ==
Bowen You

== Introduction ==

Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning, and it refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance. The code for this paper is freely available at https://github.com/google-research/dreamer.

=== Preliminaries ===

This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math> as feedback.

The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>.
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively. Informally, <math>V_{\pi}</math> tells one how good a state is in terms of the expected return when starting in the state <math>s</math> and then following the policy <math>\pi</math>. Similarly <math>Q_{\pi}</math> gives the value of the expected return starting from the state <math>s</math>, taking the action <math>a</math>, and subsequently following the policy <math>\pi</math>.

Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that
\[
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)
\]

=== Feedback Loop ===

Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence
\[
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T
\]

== Motivation ==

In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates
\[
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots
\]

By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained.

== Dreamer ==

The authors of the paper call their method Dreamer. In a high-level perspective, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below.

[[File: dreamer_overview.png | 800px]]

Let's look at Dreamer in detail. It consists of:
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math>
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math>
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math>
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math>
* Value <math display="inline"> v_{\psi}(s_t)</math>

where <math>o_{t}</math> is the observation at time <math>t</math> and <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.

There are three main components to the proposed algorithm:
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.

The proposed algorithm is described below.

[[File:dreamer.png|frameless|500px|Dreamer algorithm]]

Notice that there are three neural networks that are trained simultaneously.
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.

== Results ==

The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.

[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]

== Conclusion ==

This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment. Also, as a future work on representation learning, the ability to scale latent imagination to environments of higher visual complexity can be investigated.

== Critique ==
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.

In the model components in Appendix A, they have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.

== References ==

[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.

[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.

[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.

Meta-Learning For Domain Generalization

2020-11-21T12:56:35Z

A4moayye: /* Illustrative Synthetic Experiment */

== Presented by ==
Parsa Ashrafi Fashi

== Introduction ==

Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.

== Previous Work ==
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].

== Method ==
In the DG setting, we assume there are S source domains <math> S </math> and T target domains <math> T </math> . We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> .

The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.

=== Supervised Learning ===

First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.

==== Meta-Train ====
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math>

In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math>

==== Meta-Test ====

In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math>

The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>.

==== Final Objective Function ====

Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach.

[[File:ashraf1.jpg |center|600px]]

<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div>

=== Reinforcement Learning ===

In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. What we have in supervised learning as tasks map to reward functions here and domains map to solving the same task (reward function) in a different environments. Therefore, domain generalization achieves an agent that is able to perform well even at new environments without any initial learning.
==== Meta-Train ====
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is,
\begin{align}
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s
\end{align}

Then the optimal policy is obtained by minimizing <math> F </math>.

==== Meta-Test ====
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.
[[File:ashraf2.jpg |center|600px]]

<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div>

==== Alternative Variants of MLDG ====
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:
\begin{equation}
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.
\end{equation}

Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:
\begin{equation}
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2
\end{equation}.

== Experiments ==

The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments).

=== Illustrative Synthetic Experiment ===

In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner. The baselines for this experiment, as can be seen in Fig. 1, were MLP-All, MLDG, MLDG-GC, and MLDG-GN.

[[File:ashraf3.jpg |center|600px]]

<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div>

=== Object Detection ===
For object detection, the PACS multi-domain recognition benchmark is exploited; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods.

[[File:ashraf4.jpg |center|800px]]

<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div>

=== Cartpole ===

The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.

[[File:ashraf5.jpg |center|800px]]

<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div>

[[File:ashraf5.jpg |center|800px]]

<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div>

=== Mountain Car ===

In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.

[[File:ashraf7.jpg |center|800px]]

<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div>

== Conclusion ==

This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.

== Critiques ==

I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.

Why meta-learning makes the domain generalization to be domain agnostic?

In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?

The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets.

== References ==

[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.

[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.

[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.

[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.

[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.

[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.

[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.

Meta-Learning For Domain Generalization

2020-11-20T22:44:47Z

A4moayye: /* Reinforcement Learning */

== Presented by ==
Parsa Ashrafi Fashi

== Introduction ==

Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.

== Previous Work ==
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].

== Method ==
In the DG setting, we assume there are S source domains <math> S </math> and T target domains <math> T </math> . We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> .

The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.

=== Supervised Learning ===

First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.

==== Meta-Train ====
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math>

In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math>

==== Meta-Test ====

In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math>

The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>.

==== Final Objective Function ====

Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach.

[[File:ashraf1.jpg |center|600px]]

<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div>

=== Reinforcement Learning ===

In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. What we have in supervised learning as tasks map to reward functions here and domains map to solving the same task (reward function) in a different environments. Therefore, domain generalization achieves an agent that is able to perform well even at new environments without any initial learning.
==== Meta-Train ====
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is,
\begin{align}
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s
\end{align}

Then the optimal policy is obtained by minimizing <math> F </math>.

==== Meta-Test ====
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.
[[File:ashraf2.jpg |center|600px]]

<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div>

==== Alternative Variants of MLDG ====
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:
\begin{equation}
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.
\end{equation}

Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:
\begin{equation}
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2
\end{equation}.

== Experiments ==

The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments).

=== Illustrative Synthetic Experiment ===

In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner.

[[File:ashraf3.jpg |center|600px]]

<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div>

=== Object Detection ===
For object detection, the PACS multi-domain recognition benchmark is exploited; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods.

[[File:ashraf4.jpg |center|800px]]

<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div>

=== Cartpole ===

The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.

[[File:ashraf5.jpg |center|800px]]

<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div>

[[File:ashraf5.jpg |center|800px]]

<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div>

=== Mountain Car ===

In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.

[[File:ashraf7.jpg |center|800px]]

<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div>

== Conclusion ==

This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.

== Critiques ==

I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.

Why meta-learning makes the domain generalization to be domain agnostic?

In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?

The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets.

== References ==

[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.

[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.

[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.

[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.

[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.

[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.

[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.

SuperGLUE

2020-11-20T22:19:34Z

A4moayye: /* Results */

== Presented by ==
Shikhar Sakhuja

== Introduction ==
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3], BERT[4], etc. have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.

== Related Work ==
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing.

GLUE has been the gold standard for language understanding tests since it’s release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. The latest GPT and BERT models, however, far outperform these benchmarks and strike a need for a more robust and difficult benchmark.

== Motivation ==
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification, fake news detection, etc. The fine-tuned models beat many of the human labellers who weren’t experts in the domain. Thus, creating a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP.

[[File:loser glue.png]]

Figure 1: Transformer-based models outperforming humans in GLUE tasks.

== Design Process ==
There are 6 requirements/specifications for tasks to comprise the SuperGLUE benchmark.

#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgements of the output quality.
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.
#'''License:''' Task data must be under a license that allows the redistribution and use for research.

== SuperGLUE Tasks ==

SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to the capabilities of most college-educated English speakers and are beyond the capabilities of most state-of-the-art systems today.

'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer.

'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written with the goal of keeping the clause accurate.

'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices.

'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false.

'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the choices.

'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text that can be plausibly inferred from a given passage.

'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not.

'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.

== Model Analysis ==
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperBLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset.

== Results ==

Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size.

BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The human results for WiC, MltiRC, RTE, and ReCoRD were already available on [15], [12], [17], and [13] respectively. However, for the remaining tasks, the authors employed crowdworkers to reannotate a sample of each test set according to the methods used in [17]. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.

[[File: 800px-SuperGLUE result.png]]

Table 1: Baseline performance on SuperGLUE tasks.

== Conclusion ==
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding.

== Critique ==
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models?

I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.

== References ==
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.

[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202

[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.

[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.

[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.

[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.

[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.

[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.

[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.

[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.

[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.

[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.

[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.

[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.

[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.

[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.

[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.

SuperGLUE

2020-11-20T22:18:57Z

A4moayye: /* References */

== Presented by ==
Shikhar Sakhuja

== Introduction ==
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3], BERT[4], etc. have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.

== Related Work ==
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing.

GLUE has been the gold standard for language understanding tests since it’s release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. The latest GPT and BERT models, however, far outperform these benchmarks and strike a need for a more robust and difficult benchmark.

== Motivation ==
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification, fake news detection, etc. The fine-tuned models beat many of the human labellers who weren’t experts in the domain. Thus, creating a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP.

[[File:loser glue.png]]

Figure 1: Transformer-based models outperforming humans in GLUE tasks.

== Design Process ==
There are 6 requirements/specifications for tasks to comprise the SuperGLUE benchmark.

#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgements of the output quality.
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.
#'''License:''' Task data must be under a license that allows the redistribution and use for research.

== SuperGLUE Tasks ==

SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to the capabilities of most college-educated English speakers and are beyond the capabilities of most state-of-the-art systems today.

'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer.

'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written with the goal of keeping the clause accurate.

'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices.

'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false.

'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the choices.

'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text that can be plausibly inferred from a given passage.

'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not.

'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.

== Model Analysis ==
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperBLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset.

== Results ==

Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size.

BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.

[[File: 800px-SuperGLUE result.png]]

Table 1: Baseline performance on SuperGLUE tasks.

== Conclusion ==
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding.

== Critique ==
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models?

I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.

== References ==
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.

[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202

[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.

[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.

[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.

[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.

[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.

[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.

[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.

[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.

[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.

[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.

[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.

[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.

[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.

[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.

[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.

Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates

2020-11-15T12:55:56Z

A4moayye: /* Background */

== Presented By ==
Gaurav Sikri

== Background ==

Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.

[[File:adversarial_example.png|500px|center|Image: 500 pixels]]
<div align="center">'''Figure 1:''' Adversarial Example </div>

The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.

While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)

[[File:certified_defense.png|500px|center|Image: 500 pixels]]
<div align="center">'''Figure 2:''' Certified Defense Illustration </div>

== Introduction ==
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been build that protects neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses based on heuristics and tricks are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantees that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.

In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image), and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.

The following three conditions should be met while creating adversarial examples:

'''1. Imperceptibility: the adversarial image looks like the original example.

'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.

'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.

The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.

== Approach ==
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images is created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image.

\begin{align}
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}
\end{align}

\begin{align}
s.t. \left \|\delta \right \|_{p} \leq \epsilon
\end{align}

Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.

\begin{align}
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}
\end{align}

In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.

The perturbations created in the original images are -

'''1. small

'''2. smooth

'''3. without dramatic color changes

There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective.
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>.

* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.

== Ablation Study of the Attack parameters==
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero.
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively.

[[File:Ablation.png|500px|center|Image: 500 pixels]]

== Experiments ==
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.

=== Attack on Randomized Smoothing ===
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.

The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing -

[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]

<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images
and also natural images (larger radii means a stronger/more confident certificate) </div>

The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.

=== Attack on CROWN-IBP ===
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.

[[File:crown_ibp.png|500px|center|Image: 500 pixels]]
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div>

The above table shows the robustness errors in the case of CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.

== Conclusion ==
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.
== Critiques==

It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, where as the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.

== References ==
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.

Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates

2020-11-15T12:52:56Z

A4moayye: /* Background */

== Presented By ==
Gaurav Sikri

== Background ==

Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding small noise to an input image and as a result, the prediction of the model changed.

[[File:adversarial_example.png|500px|center|Image: 500 pixels]]
<div align="center">'''Figure 1:''' Adversarial Example </div>

The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.

While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)

[[File:certified_defense.png|500px|center|Image: 500 pixels]]
<div align="center">'''Figure 2:''' Certified Defense Illustration </div>

== Introduction ==
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been build that protects neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses based on heuristics and tricks are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantees that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.

In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image), and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.

The following three conditions should be met while creating adversarial examples:

'''1. Imperceptibility: the adversarial image looks like the original example.

'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.

'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.

The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.

== Approach ==
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images is created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image.

\begin{align}
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}
\end{align}

\begin{align}
s.t. \left \|\delta \right \|_{p} \leq \epsilon
\end{align}

Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.

\begin{align}
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}
\end{align}

In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.

The perturbations created in the original images are -

'''1. small

'''2. smooth

'''3. without dramatic color changes

There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective.
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>.

* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.

== Ablation Study of the Attack parameters==
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero.
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively.

[[File:Ablation.png|500px|center|Image: 500 pixels]]

== Experiments ==
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.

=== Attack on Randomized Smoothing ===
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.

The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing -

[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]

<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images
and also natural images (larger radii means a stronger/more confident certificate) </div>

The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.

=== Attack on CROWN-IBP ===
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.

[[File:crown_ibp.png|500px|center|Image: 500 pixels]]
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div>

The above table shows the robustness errors in the case of CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.

== Conclusion ==
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.
== Critiques==

It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, where as the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.

== References ==
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.

Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates

2020-11-15T12:49:19Z

A4moayye: /* Experiments */

== Presented By ==
Gaurav Sikri

== Background ==

Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding small noise to an input image and as a result, the prediction of the model changed.

[[File:adversarial_example.png|500px|center|Image: 500 pixels]]
<div align="center">'''Figure 1:''' Adversarial Example </div>

The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.

While training a deep network, the network is trained on a set of augmented images as well along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)

[[File:certified_defense.png|500px|center|Image: 500 pixels]]
<div align="center">'''Figure 2:''' Certified Defense Illustration </div>

== Introduction ==
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been build that protects neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses based on heuristics and tricks are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantees that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.

In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image), and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.

The following three conditions should be met while creating adversarial examples:

'''1. Imperceptibility: the adversarial image looks like the original example.

'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.

'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.

The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.

== Approach ==
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images is created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image.

\begin{align}
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}
\end{align}

\begin{align}
s.t. \left \|\delta \right \|_{p} \leq \epsilon
\end{align}

Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.

\begin{align}
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}
\end{align}

In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.

The perturbations created in the original images are -

'''1. small

'''2. smooth

'''3. without dramatic color changes

There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective.
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>.

* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.

== Ablation Study of the Attack parameters==
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero.
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively.

[[File:Ablation.png|500px|center|Image: 500 pixels]]

== Experiments ==
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.

=== Attack on Randomized Smoothing ===
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.

The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing -

[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]

<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images
and also natural images (larger radii means a stronger/more confident certificate) </div>

The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.

=== Attack on CROWN-IBP ===
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.

[[File:crown_ibp.png|500px|center|Image: 500 pixels]]
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div>

The above table shows the robustness errors in the case of CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.

== Conclusion ==
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.
== Critiques==

It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, where as the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.

== References ==
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.

The Curious Case of Degeneration

2020-11-11T18:12:06Z

A4moayye:

== Presented by ==
Donya Hamzeian
== Introduction ==
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results.
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]

As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements.

The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts

Some may raise this question that the problem with beam-search may be due to search error i.e. they are more probable phrases that beam search is unable to find, but the point is that natural language has lower per-token probability on average and people usually optimize against saying the obvious.

The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues.
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.

==Language Model Decoding==
There are two types of generation tasks.

1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Because of this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.

2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks, and in fact, they are the focus of this paper.

The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.
====Nucleus Sampling====
This decoding strategy is indeed truncating the long tail of the probability distribution. In order to do that, first, we need to find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. Then set <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math> and rescale the probability distribution with <math>p'</math> and select the tokens from <math>P'</math>.
<math>
P'(x|x_{1:i-1}) = \begin{cases}
\frac{P(x|x_{1:i-1})}{p'}, & \mbox{if } x \in V^{(p)} \\
0 & \mbox{if } otherwise
\end{cases}

</math>

====Top-k Sampling====
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size k <math>V^{(k)} </math> which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling.

====Sampling with Temperature====
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where 0<t<1 and <math>u_{1:|V|} </math> are logits. Recent studies have shown that lowering t improves the quality of the generated texts while it decreases diversity.

<math>
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}
</math>

==Likelihood Evaluation==
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.

====Perplexity====

This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts.

[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]

==Distributional Statistical Evaluation==
====Zipf Distribution Analysis====
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]

====Self BLEU====
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling.

[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]

==Conclusion==
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies- which rely on truncating the probability distribution of tokens- especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.

== References ==
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.

[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018

The Curious Case of Degeneration

2020-11-11T17:58:47Z

A4moayye:

== Presented by ==
Donya Hamzeian
== Introduction ==
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results.
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]

As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements.

The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts

Some may raise this question that the problem with beam-search may be due to search error i.e. they are more probable phrases that beam search is unable to find, but the point is that natural language has lower per-token probability on average and people usually optimize against saying the obvious.

The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues.
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.

==Language Model Decoding==
There are two types of generation tasks.

1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Because of this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.

2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks, and in fact, they are the focus of this paper.

The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.
====Nucleus Sampling====
This decoding strategy is indeed truncating the long tail of the probability distribution. In order to do that, first, we need to find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. Then set <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math> and rescale the probability distribution with <math>p'</math> and select the tokens from <math>P'</math>.
<math>
P'(x|x_{1:i-1}) = \begin{cases}
\frac{P(x|x_{1:i-1})}{p'}, & \mbox{if } x \in V^{(p)} \\
0 & \mbox{if } otherwise
\end{cases}

</math>

====Top-k Sampling====
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size k <math>V^{(k)} </math> which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling.

====Sampling with Temperature====
The probability of tokens will be calculated according to the equation below where 0<t<1 and <math>u_{1:|V|} </math> are logits. Recent studies have shown that lowering t improves the quality of the generated texts while it decreases diversity.

<math>
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}
</math>

==Likelihood Evaluation==
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.

====Perplexity====

This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts.

[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]

==Distributional Statistical Evaluation==
====Zipf Distribution Analysis====
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]

====Self BLEU====
The Self-BLEU score[1] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling.

[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]

==Conclusion==
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies- which rely on truncating the probability distribution of tokens- especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.

== References ==
[1]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018

The Curious Case of Degeneration

2020-11-11T17:41:31Z

A4moayye: /* Language Model Decoding */

== Presented by ==
Donya Hamzeian
== Introduction ==
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results.
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]

As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements.

The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts

Some may raise this question that the problem with beam-search may be due to search error i.e. they are more probable phrases that beam search is unable to find, but the point is that natural language has lower per-token probability on average and people usually optimize against saying the obvious.

The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues.
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.

==Language Model Decoding==
There are two types of generation tasks.

1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Because of this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.

2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks, and in fact, they are the focus of this paper.

The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.
====Nucleus Sampling====
This decoding strategy is indeed truncating the long tail of the probability distribution. In order to do that, first, we need to find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. Then set <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math> and rescale the probability distribution with <math>p'</math> and select the tokens from <math>P'</math>.
<math>
P'(x|x_{1:i-1}) = \begin{cases}
\frac{P(x|x_{1:i-1})}{p'}, & \mbox{if } x \in V^{(p)} \\
0 & \mbox{if } otherwise
\end{cases}

</math>

====Top-k Sampling====
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size k <math>V^{(k)} </math> which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling.

====Sampling with Temperature====
The probability of tokens will be calculated according to the equation below where 0<t<1 and <math>u_{1:|V|} </math> are logits. Recent studies have shown that lowering t improves the quality of the generated texts while it decreases diversity.

<math>
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}
</math>

==Likelihood Evaluation==
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.

====Perplexity====

This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts.

[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]

==Distributional Statistical Evaluation==
====Zipf Distribution Analysis====
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]

====Self BLEU====
The Self-BLEU score is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling.

[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]

==Conclusion==
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies- which rely on truncating the probability distribution of tokens- especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.

The Curious Case of Degeneration

2020-11-11T17:32:21Z

A4moayye: /* Language Model Decoding */

== Presented by ==
Donya Hamzeian
== Introduction ==
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results.
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]

As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements.

The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts

Some may raise this question that the problem with beam-search may be due to search error i.e. they are more probable phrases that beam search is unable to find, but the point is that natural language has lower per-token probability on average and people usually optimize against saying the obvious.

The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues.
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.

==Language Model Decoding==
There are two types of generation tasks.

1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Because of this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.

2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks, and in fact, they are the focus of this paper.

The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.
====Nucleus Sampling====
This decoding strategy is indeed truncating the long tail of the probability distribution. In order to do that, first, we need to find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. Then set p'= <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math> and rescale the probability distribution with p' and select the tokens from P'.
<math>
P'(x|x_{1:i-1} = \begin{cases}
\frac{P(x|x_{1:i-1})}{p'}, & \mbox{if } x \in V^{(p)} \\
0 & \mbox{if } otherwise
\end{cases}

</math>

====Top-k Sampling====
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size k <math>V^{(k)} </math> which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling.

====Sampling with Temperature====
The probability of tokens will be calculated according to the equation below where 0<t<1 and <math>u_{1:|V|} </math> are logits. Recent studies have shown that lowering t improves the quality of the generated texts while it decreases diversity.

<math>
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}
</math>

==Likelihood Evaluation==
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.

====Perplexity====

This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts.

[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]

==Distributional Statistical Evaluation==
====Zipf Distribution Analysis====
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]

====Self BLEU====
The Self-BLEU score is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling.

[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]

==Conclusion==
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies- which rely on truncating the probability distribution of tokens- especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.

stat940F21

2020-11-08T13:11:20Z

A4moayye: /* Paper presentation */

== [[F20-STAT 946-Proposal| Project Proposal ]] ==

= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique on the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

=Paper presentation=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="700pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|width="30pt"|Link to the video
|-
|-
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]
|-
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]]
|-
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]
|-
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||
|-
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]
|-
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video]
|-
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]
|-
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||
|-
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||
|-
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn
|-
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||
|-
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||
|-
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || ||
|-
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||
|-
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||
|-
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||
|-
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||
|-
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||
|-
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||
|-
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||
|-
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||
|-
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||
|-
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||
|-
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||
|-
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||
|-
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-
|-
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||
|-
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||
|-
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||
|-
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||
|-
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||
|-
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||
|-
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||

stat940F21

2020-11-08T13:09:35Z

A4moayye: /* Paper presentation */

== [[F20-STAT 946-Proposal| Project Proposal ]] ==

= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique on the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

=Paper presentation=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="700pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|width="30pt"|Link to the video
|-
|-
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]
|-
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]]
|-
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]
|-
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||
|-
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]
|-
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video]
|-
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]
|-
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||
|-
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||
|-
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] ||
|-
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||
|-
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||
|-
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || ||
|-
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||
|-
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||
|-
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||
|-
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||
|-
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||
|-
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||
|-
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||
|-
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||
|-
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||
|-
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||
|-
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||
|-
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||
|-
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-
|-
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||
|-
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||
|-
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||
|-
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||
|-
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||
|-
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||
|-
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||

stat940F21

2020-11-08T13:09:01Z

A4moayye: /* Paper presentation */

== [[F20-STAT 946-Proposal| Project Proposal ]] ==

= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique on the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

=Paper presentation=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="700pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|width="30pt"|Link to the video
|-
|-
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]
|-
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]]
|-
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]
|-
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||
|-
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]
|-
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video]
|-
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]
|-
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||
|-
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||
|-
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F] ||
|-
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||
|-
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||
|-
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || ||
|-
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||
|-
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||
|-
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||
|-
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||
|-
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||
|-
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||
|-
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||
|-
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||
|-
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||
|-
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||
|-
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||
|-
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||
|-
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-
|-
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||
|-
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||
|-
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||
|-
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||
|-
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||
|-
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||
|-
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||

stat940F21

2020-11-08T13:08:31Z

A4moayye: /* Paper presentation */

== [[F20-STAT 946-Proposal| Project Proposal ]] ==

= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique on the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

=Paper presentation=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="100pt"|Name
|width="30pt"|Paper number
|width="700pt"|Title
|width="30pt"|Link to the paper
|width="30pt"|Link to the summary
|width="30pt"|Link to the video
|-
|-
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]
|-
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]]
|-
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]
|-
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||
|-
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]
|-
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video]
|-
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]
|-
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||
|-
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||
|-
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F ||
|-
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||
|-
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||
|-
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || ||
|-
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||
|-
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||
|-
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||
|-
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||
|-
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||
|-
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||
|-
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||
|-
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||
|-
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||
|-
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||
|-
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||
|-
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||
|-
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-
|-
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||
|-
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||
|-
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||
|-
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||
|-
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||
|-
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||
|-
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T13:07:25Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|800px]]

<div align="center">Figure 1: Suggested architecture.</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[3]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

[[File:arash2.JPG |center|800px]]

<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div>

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

[[File:arash3.JPG |center|800px]]

<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks.</div>

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

[[File:arash4.JPG |center|800px]]

<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div>

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

[[File:arash5.JPG |center|800px]]

<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effct of domain shift on SSL.</div>

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

[[File:arash6.JPG |center|800px]]

<div align="center">Figure 6: Comparison with prior works on mini_ImageNet.</div>

== References ==

[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)

[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)

[3]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T10:59:13Z

A4moayye: /* Critiques */

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|800px]]

<div align="center">Figure 1: Suggested architecture.</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

[[File:arash2.JPG |center|800px]]

<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div>

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

[[File:arash3.JPG |center|800px]]

<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks.</div>

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

[[File:arash4.JPG |center|800px]]

<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div>

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

[[File:arash5.JPG |center|800px]]

<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effct of domain shift on SSL.</div>

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

[[File:arash6.JPG |center|800px]]

<div align="center">Figure 6: Comparison with prior works on mini_ImageNet.</div>

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T08:15:37Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|800px]]

<div align="center">Figure 1: Suggested architecture.</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

[[File:arash2.JPG |center|800px]]

<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div>

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

[[File:arash3.JPG |center|800px]]

<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks.</div>

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

[[File:arash4.JPG |center|800px]]

<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div>

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

[[File:arash5.JPG |center|800px]]

<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effct of domain shift on SSL.</div>

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

[[File:arash6.JPG |center|800px]]

<div align="center">Figure 6: Comparison with prior works on mini_ImageNet.</div>

File:arash6.JPG

2020-11-08T08:15:13Z

A4moayye:

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T08:13:58Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|800px]]

<div align="center">Figure 1: Suggested architecture.</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

[[File:arash2.JPG |center|800px]]

<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div>

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

[[File:arash3.JPG |center|800px]]

<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks.</div>

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

[[File:arash4.JPG |center|800px]]

<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div>

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

[[File:arash5.JPG |center|800px]]

<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effct of domain shift on SSL.</div>

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T08:12:58Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|800px]]

<div align="center">Figure 1: Suggested architecture.</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

[[File:arash2.JPG |center|800px]]

<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div>

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

[[File:arash3.JPG |center|800px]]

<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks<./div>

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

[[File:arash4.JPG |center|800px]]

<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div>

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

[[File:arash5.JPG |center|800px]]

<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effct of domain shift on SSL.</div>

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T08:12:03Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 1: Suggested architecture.</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

[[File:arash2.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div>

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

[[File:arash3.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks<./div>

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

[[File:arash4.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div>

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

[[File:arash5.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effct of domain shift on SSL.</div>

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T08:11:19Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 1: Suggested architecture.</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

[[File:arash2.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div>

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

[[File:arash3.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks<./div>

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

[[File:arash4.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div>

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

[[File:arash5.JPG |center|1000px|Image: 1000 pixels]]

<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effct of domain shift on SSL.</div>

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T08:06:37Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|1000px|Image: 1000 pixels]]
<div align="center">Figure 1: Suggested architecture.</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

[[File:arash2.JPG |center|1000px|Image: 1000 pixels]]
<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div>

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

[[File:arash3.JPG |center|1000px|Image: 1000 pixels]]
<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks<./div>

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

[[File:arash4.JPG |center|1000px|Image: 1000 pixels]]
<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div>

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

[[File:arash5.JPG |center|1000px|Image: 1000 pixels]]
<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div>

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

File:arash5.JPG

2020-11-08T07:14:50Z

A4moayye:

File:arash4.JPG

2020-11-08T07:14:38Z

A4moayye:

File:arash3.JPG

2020-11-08T07:10:19Z

A4moayye:

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T07:07:39Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|1000px|Image: 1000 pixels]]
<div align="center">Figure 1: Suggested architecture</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

[[File:arash2.JPG |center|1000px|Image: 1000 pixels]]
<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks</div>

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

File:arash2.JPG

2020-11-08T07:06:47Z

A4moayye:

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T07:04:53Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG |center|1000px|Image: 1000 pixels]]
<div align="center">Figure 1: Suggested architecture</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T07:03:04Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG | center]]
<div align="center", style="height: 250px">Figure 1: Suggested architecture</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T07:00:18Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG | center]]
<div align="center">Figure 1: Suggested architecture</div>

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T06:59:36Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

[[File:arash1.JPG | center]]

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

File:arash1.JPG

2020-11-08T06:57:55Z

A4moayye: A4moayye uploaded a new version of File:arash1.JPG

File:arash1.JPG

2020-11-08T06:57:04Z

A4moayye:

File:Architecture.JPG

2020-11-08T06:49:28Z

A4moayye:

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T06:33:14Z

A4moayye:

== Presented by ==
Arash Moayyedi

== Introduction ==
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.

== Previous Work ==
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many different few-shot learning methods currently exist, among which, this paper focuses on Prototypical Networks or ProtoNets for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML).

The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabeled data, while labeling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.

== Method ==
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.

The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:

<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math>

<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math>

The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.

== Experiments ==
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircrafts, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.

Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.

== Results ==
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because of the fact that flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.

In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low resolution images, which make the classification harder for natural and man-made objects, respectively.

Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.

In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percent of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often than increasing the size, but shifting the domain. This is shown az crosses on the chart.

== Conclusion ==
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also shown that the dataset used for SSL should not necessarily be large. Increase the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.

== Critiques ==
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Moreover, comparing their work with previous works, we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a $84\times84$ image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T06:19:04Z

A4moayye:

When Does Self-Supervision Improve Few-Shot Learning?

2020-11-08T06:18:41Z

A4moayye: Undo revision 43383 by A4moayye (talk)