http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Mrasooli&feedformat=atomstatwiki - User contributions [US]2024-03-29T02:18:34ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48845Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T05:42:46Z<p>Mrasooli: /* Critique */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the derivative of the discriminator to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way for evaluating the quality of the generator and that is inspecting the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator which is able to learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification task and also comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As can be seen from Figure 2, although the training has been unsupervised, the semantic relation between classed are well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the defined formula and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48844Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T05:40:52Z<p>Mrasooli: /* Experiments */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the derivative of the discriminator to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way for evaluating the quality of the generator and that is inspecting the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator which is able to learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification task and also comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As can be seen from Figure 2, although the training has been unsupervised, the semantic relation between classed are well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper is has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the formula it defined and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48843Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T05:40:28Z<p>Mrasooli: /* Experiments */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the derivative of the discriminator to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way for evaluating the quality of the generator and that is inspecting the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator which is able to learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification task and also comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As can be seen from Figure 2, although the training has been unsupervised, the semantic relation between classed are well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper is has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the formula it defined and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48842Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T05:39:57Z<p>Mrasooli: /* Evaluating AFV representations */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the derivative of the discriminator to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way for evaluating the quality of the generator and that is inspecting the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator which is able to learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification task and also comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As can be seen from Figure 2, although the training has been unsupervised, the semantic relation between classed are well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper is has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the formula it defined and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48840Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T05:38:06Z<p>Mrasooli: /* Conclusion */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the derivative of the discriminator to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way for evaluating the quality of the generator and that is inspecting the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator which is able to learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification task and also comparable with the supervised learning.<br />
<br />
[[File:Table1.png]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As can be seen from Figure 2, although the training has been unsupervised, the semantic relation between classed are well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper is has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the formula it defined and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48836Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T05:35:21Z<p>Mrasooli: /* GANs as variational training of deep EBMs */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the derivative of the discriminator to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way for evaluating the quality of the generator and that is inspecting the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator which is able to learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification task and also comparable with the supervised learning.<br />
<br />
[[File:Table1.png]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As can be seen from Figure 2, although the training has been unsupervised, the semantic relation between classed are well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or labeled data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, we showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper is has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the formula it defined and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=48834ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-12-02T05:29:34Z<p>Mrasooli: /* Factorized embedding parameterization */</p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer parameters than BERT-large, but it still produces better results. The changes made to BERT model are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropouts from the model.<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption, but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass but reduces memory requirements in a gradient checkpoint technique technique. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that their parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 (where H is the size of the hidden layer). Next, we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This choice is not efficient because we may need to have a large hidden layer but not a large vocabulary embedding layer. This issue is a case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However, if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is the size of the vocabulary and is usually quite large. For example, <math display="inline">\\V</math> equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing. For example, one may only share feed-forward network parameters or only share attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. In both cases, sharing all the parameters has a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters instead of the attention parameters. Given this, the authors have decided to share all the parameters across the layers, resulting in much smaller number of parameters, which in turn enable them to have larger hidden layers, which is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al., 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modelling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss where positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However, experiments show that NSP is not effective, and it should also be pointed out that NSP loss overlaps with MLM loss in terms of the task in topic prediction. In fact, the necessity of NSP loss has been questioned in the literature (Lample and Conneau,2019; Joshi et al., 2019). The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal. For example, the model can easily be trained to learn "I love cats" and "I had sushi for lunch" are not coherent as they are already very different topic-wise, but might not be able to tell that "I love cats" and "my mom owned a dog" are not next to each other.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again a binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to an acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png | center |800px]]<br />
<br />
<br />
'''What does sentence order prediction (SOP) look like?'''<br />
<br />
'''Through a mathematical lens:'''<br />
<br />
First we will present some variable as follows. <math display="inline">\vec{s_{j}}</math> is the <math display="inline">j^{th}</math> textual segment in a document, <math display="inline"> D </math>. Here <math display="inline"> \vec{s_{j}} \in span \{ \vec{w^{j}_1}, ... , \vec{w^{j}_n} \} </math>. <math display="inline"> \vec{w^{j}_i} </math> is the <math display="inline">i^{th}</math> word in <math display="inline">\vec{s_{j}}</math>. Now the task of SOP is given <math display="inline">\vec{s_{k}}</math> to predict whether a following textual segment <math display="inline">\vec{s_{k+1}}</math> is truly the following sentence or not. Here the subscripts <math display="inline">k</math> and <math display="inline">k+1</math> denote the ordering. The task is predict whether <math display="inline">\vec{s_{k+1}}</math> is actually <math display="inline">\vec{s_{j+1}}</math> or <math display="inline">\vec{s_{j}}</math>.<br />
<br />
<br />
'''Through a visual lens:'''<br />
<br />
[[File:SOP.PNG | center | 800px]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. Table 8 below shows the effect of removing dropout. They also observe that the model does not overfit the data even after 1M steps of training. The authors point out that empirical [8] and theoretical [9] evidence suggests that batch normalization in combination with dropout may have harmful results, particularly in convolutional neural networks. They speculate that dropout may be having a similar effect here.<br />
[[File:dropout.png | center |800px]]<br />
<br />
===Effect of Network Depth and Width===<br />
<br />
In table 11, we can see the effect of increasing the number of layers. In all these settings the size of hidden layers is 1024. It appears that with increasing the depth of the model we get better and better results until the number of layers reaches 24. However, it seems that increasing the depth from 24 to 48 will decline the performance of the model.<br />
<br />
[[File:ALBERT_table11.png | center |800px]]<br />
<br />
Table 12 shows the effect of the width of the model. It was observed that the accuracy of the model improved till the width of the network reaches 4096 and after that, any further increase in the width appears to have a decline in the accuracy of the model.<br />
[[File:ALBERT_table12.png | center |800px]]<br />
<br />
Table 13 investigates if we need a very deep model when the model is very wide. It seems that when we have H=4096, the difference between the performance of models with 12 or 24 layers is negligible. <br />
[[File:ALBERT_table13.png | center |800px]]<br />
<br />
These three tables illustrate the logic behind the authors' decisions about the width and depth of the model.<br />
== Source Code ==<br />
<br />
The official source code is available at: https://github.com/google-research/ALBERT<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png | center |800px]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of the same number of steps. Here are the results:<br />
[[File:sameTime.png | center |800px]]<br />
<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better to look at the results when they let the BERT-large be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favour of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that the BERT-large is better at first but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
This paper proposed an embedding factorization to reduce the number of parameters in the embedding dimension, but the authors didn't cite or compare to related approaches. However, this kind of dimensionality reduction has been explored with other techniques, for example for knowledge distillation, quantization, or even adaptive input/softmax.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341<br />
<br />
[6]: Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. 2019. URL https://arxiv.org/abs/1907.10529<br />
<br />
[7]: Guillaume Lample and Alexis Conneau. Crosslingual language model pretraining. 2019. URL https://arxiv.org/abs/1901.07291<br />
<br />
[8]: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.<br />
<br />
[9]: Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690, 2019</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=48831Self-Supervised Learning of Pretext-Invariant Representations2020-12-02T05:21:55Z<p>Mrasooli: /* Image Classification with linear models */</p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations using a large number of data with pre-defined semantic annotations. Some examples of these annotations are class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data that is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community to find image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''Self-Supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show, in self-supervised learning, there is no need for using human-provided class labels or bounding boxes for classification and object detection tasks, respectively. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of used transformation. Two of the most used pretext tasks are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. In the jigsaw task, that is more complicated than the rotation prediction task, unlabeled images are cropped into 9 patches, then, the image is perturbed by randomly permuting the nine patches. Each permutation falls into one of the 35 classes according to a formula. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to greyscale, and image reconstruction, where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
Although the proposed pretext tasks have achieved promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations the are common between the original images and the transformed ones. This idea is supported by the fact that humans are able to recognize these transformed images. This hints us to try to develop a method that obtains image representations that are common between the original and transformed images, in other words, image representations that are transformation invariant. The paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that learns to obtain Self-Supervised image representations that as opposed to Pretext Tasks are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image ,<math>I</math>, in the Dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t)<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two set of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t}}{\tau}) \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t}}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}}}{\tau}) \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers) , <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math> , increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, are increased. During training a memory bank [], <math>m_I</math>, of dataset image representations are used to access the representations of the dataset images including the negative images. The proposed PIRL model is shown in Figure (4). Finally, the contrastive loss in equation (5) does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
Where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in the last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
A Faster R-CNN[] model with a ResNet-50 backbone, pre-trained using PIRL and other Self-Supervised methods, is employed for the object detection task. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised-based methods, '''for the first time it outperforms Supervised Pre-training on object detection'''. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the pre-trained ResNet-50 model is utilized as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results demonstrate that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pre-trained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, the results show that PIRL the best performance among different Self-Supervised Learning methods. Even, it is able to perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 400px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of image transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL is able to use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure (10), increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images, and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. That may result in better image representations.<br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=48830Self-Supervised Learning of Pretext-Invariant Representations2020-12-02T05:19:14Z<p>Mrasooli: /* Object Detection */</p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations using a large number of data with pre-defined semantic annotations. Some examples of these annotations are class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data that is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community to find image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''Self-Supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show, in self-supervised learning, there is no need for using human-provided class labels or bounding boxes for classification and object detection tasks, respectively. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of used transformation. Two of the most used pretext tasks are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. In the jigsaw task, that is more complicated than the rotation prediction task, unlabeled images are cropped into 9 patches, then, the image is perturbed by randomly permuting the nine patches. Each permutation falls into one of the 35 classes according to a formula. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to greyscale, and image reconstruction, where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
Although the proposed pretext tasks have achieved promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations the are common between the original images and the transformed ones. This idea is supported by the fact that humans are able to recognize these transformed images. This hints us to try to develop a method that obtains image representations that are common between the original and transformed images, in other words, image representations that are transformation invariant. The paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that learns to obtain Self-Supervised image representations that as opposed to Pretext Tasks are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image ,<math>I</math>, in the Dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t)<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two set of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t}}{\tau}) \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t}}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}}}{\tau}) \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers) , <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math> , increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, are increased. During training a memory bank [], <math>m_I</math>, of dataset image representations are used to access the representations of the dataset images including the negative images. The proposed PIRL model is shown in Figure (4). Finally, the contrastive loss in equation (5) does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
Where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in the last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
A Faster R-CNN[] model with a ResNet-50 backbone, pre-trained using PIRL and other Self-Supervised methods, is employed for the object detection task. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised-based methods, '''for the first time it outperforms Supervised Pre-training on object detection'''. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the ResNet-50 pretrained model is fixed and used as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results show that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pretrained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, the results show that PIRL performs best among different Self-Supervised Learning methods. Even, it is able to perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 400px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of image transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL is able to use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure (10), increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images, and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. That may result in better image representations.<br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=48829Self-Supervised Learning of Pretext-Invariant Representations2020-12-02T05:07:28Z<p>Mrasooli: /* Introduction */</p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations using a large number of data with pre-defined semantic annotations. Some examples of these annotations are class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data that is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community to find image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''Self-Supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show, in self-supervised learning, there is no need for using human-provided class labels or bounding boxes for classification and object detection tasks, respectively. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of used transformation. Two of the most used pretext tasks are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. In the jigsaw task, that is more complicated than the rotation prediction task, unlabeled images are cropped into 9 patches, then, the image is perturbed by randomly permuting the nine patches. Each permutation falls into one of the 35 classes according to a formula. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to greyscale, and image reconstruction, where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
Although the proposed pretext tasks have achieved promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations the are common between the original images and the transformed ones. This idea is supported by the fact that humans are able to recognize these transformed images. This hints us to try to develop a method that obtains image representations that are common between the original and transformed images, in other words, image representations that are transformation invariant. The paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that learns to obtain Self-Supervised image representations that as opposed to Pretext Tasks are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image ,<math>I</math>, in the Dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t)<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two set of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t}}{\tau}) \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t}}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}}}{\tau}) \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers) , <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math> , increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, are increased. During training a memory bank [], <math>m_I</math>, of dataset image representations are used to access the representations of the dataset images including the negative images. The proposed PIRL model is shown in Figure (4). Finally, the contrastive loss in equation (5) does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
Where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in the last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
For object detection, a Faster R-CNN[] model is used with a ResNet-50 backbone which is pre-trained using PIRL and other Self-Supervised methods. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised based methods, '''for the first time it outperforms Supervised Pretraining on object detection'''. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the ResNet-50 pretrained model is fixed and used as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results show that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pretrained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, the results show that PIRL performs best among different Self-Supervised Learning methods. Even, it is able to perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 400px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of image transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL is able to use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure (10), increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images, and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. That may result in better image representations.<br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=48828Self-Supervised Learning of Pretext-Invariant Representations2020-12-02T05:02:09Z<p>Mrasooli: /* Introduction */</p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations using a large number of data with pre-defined semantic annotations. Some examples of these annotations are class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data that is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community to find image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''Self-Supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show, in self-supervised learning, there is no need for using human-provided class labels or bounding boxes for classification and object detection tasks, respectively. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of used transformation. Two of the most used pretext tasks are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. Also, in the jigsaw task which is more complicated than the rotation task, unlabeled images are cropped into 9 patches and then, the image is perturbed by randomly permuting the nine patches. Each permutation falls into one of the 35 classes according to a formula. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to greyscale, and image reconstruction, where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
Although the proposed Pretext Tasks have obtained promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations the are common between the original images and the transformed ones. This idea is supported by the fact that humans are able to recognize these transformed images. This hints us to try to develop a method that obtains image representations that are common between the original and transformed images, in other words, image representations that are transformation invariant. The summarized paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that learns to obtain Self-Supervised image representations that as opposed to Pretext Tasks are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image ,<math>I</math>, in the Dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t)<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two set of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t}}{\tau}) \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t}}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}}}{\tau}) \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers) , <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math> , increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, are increased. During training a memory bank [], <math>m_I</math>, of dataset image representations are used to access the representations of the dataset images including the negative images. The proposed PIRL model is shown in Figure (4). Finally, the contrastive loss in equation (5) does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
Where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in the last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
For object detection, a Faster R-CNN[] model is used with a ResNet-50 backbone which is pre-trained using PIRL and other Self-Supervised methods. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised based methods, '''for the first time it outperforms Supervised Pretraining on object detection'''. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the ResNet-50 pretrained model is fixed and used as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results show that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pretrained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, the results show that PIRL performs best among different Self-Supervised Learning methods. Even, it is able to perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 400px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of image transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL is able to use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure (10), increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images, and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. That may result in better image representations.<br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48817Roberta2020-12-02T04:07:51Z<p>Mrasooli: /* Conclusion */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but it remains challenging to determine which parts the methods have the most contribution. This paper proposed Roberta, which replicates BERT pretraining, and investigates the effects of hyperparameters tuning and training set size. In summary, main contributions of this paper are (1) modifying some BERT design choices and training schemes. (2) a new set of datasets. These 2 modification categories improve performance on downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modeling (MLM) and next sentence prediction (NSP) as their objectives. The MLM objective randomly selects some of the tokens in the input sequence and replaces them with the special token [MASK]. Then they try to predict these tokens based on the surrounding information. NSP employs a binary classification loss for the prediction of whether the two sentences are adjacent to each other or not. It is noteworthy that, while selecting positive next sentences is trivial, generating negative ones is often much more difficult. Originally, they used a random next sentence as a negative. They used the Adam optimizer with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE, and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As mentioned in the previous section, the masked language modeling objective in BERT pre-training masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. To extend this single static mask, the authors duplicated training data 10 times so that each sequence was masked in 10 different ways. The model was trained on those data for 40 epochs, with each sequence with the same mask being used 4 times.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
Next, they tried to investigate the necessity of the next sentence prediction objective. They tried different settings to show it would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document chosen at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: This is the same as the segment-pair representation but with pairs of sentences instead of segments. However, the total length of sequences here would be a lot less than 512. Hence, a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, with the difference that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. However, the authors chose to use FULL-SENTENCES for convenience sake, since the DOC-SENTENCES results in variable batch sizes.<br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes. The result suggested the original BERT batch size was too small. The authors used 8k batch size in the remainder of their experiments. <br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE. BPE is a hybrid between character and word level modeling based on sub-word units. The authors also used a vocabulary size of 50k rather than 30k at the original BERT implementation did, this increases the total parameters by approximately 15M to 20M for the BERT based and BERT large respectively. This change actually results in slight degradation of end-task performance in some cases, however, the authors preferred having the ability to universally encode text without introducing the "unknown" token.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all the aforementioned modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
In conjunction with the modifications of dynamic masking, full-sentences whiteout an NLP loss, large mini-batches, and a larger byte-level BPE.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
<div align="center">'''Table 5:''' RoBERTa's developement set results during pretraining over more data (160GB from 16GB) and longer duration (100K to 300K to 500K steps). Results are accumulated from each row. </div><br />
<br />
RoBERTa has outperformed state-of-the-art algorithms in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the below table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
Table 3 presents the results for the experiments. RoBERTa offers major improvements over BERT (Large). Three additional datasets are added with the original dataset with the original step numbers (100K). In total, 160GB is used for pretraining. Finally, RoBERTa is pretrained for far more steps. Steps were increased from 100K to 300K and then to 500K. Improvements were observed across all downstream tasks. With the 500K steps, XLNet(large) is also outperformed across most tasks.<br />
<br />
== Conclusion ==<br />
The results confirmed that employing large batches over more data along with longer training time improves the performance. In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will achieve the same performances as RoBERTa.<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa]. The original repository for RoBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in [https://github.com/Yaozeng/roberta-tf PyTorch to TensorFlow] repository as well.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48815Roberta2020-12-02T04:01:59Z<p>Mrasooli: /* Input Representation and Next Sentence Prediction */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but it remains challenging to determine which parts the methods have the most contribution. This paper proposed Roberta, which replicates BERT pretraining, and investigates the effects of hyperparameters tuning and training set size. In summary, main contributions of this paper are (1) modifying some BERT design choices and training schemes. (2) a new set of datasets. These 2 modification categories improve performance on downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modeling (MLM) and next sentence prediction (NSP) as their objectives. The MLM objective randomly selects some of the tokens in the input sequence and replaces them with the special token [MASK]. Then they try to predict these tokens based on the surrounding information. NSP employs a binary classification loss for the prediction of whether the two sentences are adjacent to each other or not. It is noteworthy that, while selecting positive next sentences is trivial, generating negative ones is often much more difficult. Originally, they used a random next sentence as a negative. They used the Adam optimizer with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE, and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As mentioned in the previous section, the masked language modeling objective in BERT pre-training masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. To extend this single static mask, the authors duplicated training data 10 times so that each sequence was masked in 10 different ways. The model was trained on those data for 40 epochs, with each sequence with the same mask being used 4 times.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
Next, they tried to investigate the necessity of the next sentence prediction objective. They tried different settings to show it would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document chosen at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: This is the same as the segment-pair representation but with pairs of sentences instead of segments. However, the total length of sequences here would be a lot less than 512. Hence, a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, with the difference that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. However, the authors chose to use FULL-SENTENCES for convenience sake, since the DOC-SENTENCES results in variable batch sizes.<br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes. The result suggested the original BERT batch size was too small. The authors used 8k batch size in the remainder of their experiments. <br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE. BPE is a hybrid between character and word level modeling based on sub-word units. The authors also used a vocabulary size of 50k rather than 30k at the original BERT implementation did, this increases the total parameters by approximately 15M to 20M for the BERT based and BERT large respectively. This change actually results in slight degradation of end-task performance in some cases, however, the authors preferred having the ability to universally encode text without introducing the "unknown" token.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all the aforementioned modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
In conjunction with the modifications of dynamic masking, full-sentences whiteout an NLP loss, large mini-batches, and a larger byte-level BPE.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
<div align="center">'''Table 5:''' RoBERTa's developement set results during pretraining over more data (160GB from 16GB) and longer duration (100K to 300K to 500K steps). Results are accumulated from each row. </div><br />
<br />
RoBERTa has outperformed state-of-the-art algorithms in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the below table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
Table 3 presents the results for the experiments. RoBERTa offers major improvements over BERT (Large). Three additional datasets are added with the original dataset with the original step numbers (100K). In total, 160GB is used for pretraining. Finally, RoBERTa is pretrained for far more steps. Steps were increased from 100K to 300K and then to 500K. Improvements were observed across all downstream tasks. With the 500K steps, XLNet(large) is also outperformed across most tasks.<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will achieve the same performances as RoBERTa.<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa]. The original repository for RoBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in [https://github.com/Yaozeng/roberta-tf PyTorch to TensorFlow] repository as well.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48814Roberta2020-12-02T04:00:39Z<p>Mrasooli: /* Static vs. Dynamic Masking */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but it remains challenging to determine which parts the methods have the most contribution. This paper proposed Roberta, which replicates BERT pretraining, and investigates the effects of hyperparameters tuning and training set size. In summary, main contributions of this paper are (1) modifying some BERT design choices and training schemes. (2) a new set of datasets. These 2 modification categories improve performance on downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modeling (MLM) and next sentence prediction (NSP) as their objectives. The MLM objective randomly selects some of the tokens in the input sequence and replaces them with the special token [MASK]. Then they try to predict these tokens based on the surrounding information. NSP employs a binary classification loss for the prediction of whether the two sentences are adjacent to each other or not. It is noteworthy that, while selecting positive next sentences is trivial, generating negative ones is often much more difficult. Originally, they used a random next sentence as a negative. They used the Adam optimizer with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE, and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As mentioned in the previous section, the masked language modeling objective in BERT pre-training masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. To extend this single static mask, the authors duplicated training data 10 times so that each sequence was masked in 10 different ways. The model was trained on those data for 40 epochs, with each sequence with the same mask being used 4 times.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objective. They tried different settings to show it would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document chosen at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: This is the same as the segment-pair representation but with pairs of sentences instead of segments. However, the total length of sequences here would be a lot less than 512. Hence, a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, with the difference that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. However, the authors chose to use FULL-SENTENCES for convenience sake, since the DOC-SENTENCES results in variable batch sizes.<br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes. The result suggested the original BERT batch size was too small. The authors used 8k batch size in the remainder of their experiments. <br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE. BPE is a hybrid between character and word level modeling based on sub-word units. The authors also used a vocabulary size of 50k rather than 30k at the original BERT implementation did, this increases the total parameters by approximately 15M to 20M for the BERT based and BERT large respectively. This change actually results in slight degradation of end-task performance in some cases, however, the authors preferred having the ability to universally encode text without introducing the "unknown" token.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all the aforementioned modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
In conjunction with the modifications of dynamic masking, full-sentences whiteout an NLP loss, large mini-batches, and a larger byte-level BPE.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
<div align="center">'''Table 5:''' RoBERTa's developement set results during pretraining over more data (160GB from 16GB) and longer duration (100K to 300K to 500K steps). Results are accumulated from each row. </div><br />
<br />
RoBERTa has outperformed state-of-the-art algorithms in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the below table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
Table 3 presents the results for the experiments. RoBERTa offers major improvements over BERT (Large). Three additional datasets are added with the original dataset with the original step numbers (100K). In total, 160GB is used for pretraining. Finally, RoBERTa is pretrained for far more steps. Steps were increased from 100K to 300K and then to 500K. Improvements were observed across all downstream tasks. With the 500K steps, XLNet(large) is also outperformed across most tasks.<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will achieve the same performances as RoBERTa.<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa]. The original repository for RoBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in [https://github.com/Yaozeng/roberta-tf PyTorch to TensorFlow] repository as well.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48813Roberta2020-12-02T03:58:50Z<p>Mrasooli: /* Introduction */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but it remains challenging to determine which parts the methods have the most contribution. This paper proposed Roberta, which replicates BERT pretraining, and investigates the effects of hyperparameters tuning and training set size. In summary, main contributions of this paper are (1) modifying some BERT design choices and training schemes. (2) a new set of datasets. These 2 modification categories improve performance on downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modeling (MLM) and next sentence prediction (NSP) as their objectives. The MLM objective randomly selects some of the tokens in the input sequence and replaces them with the special token [MASK]. Then they try to predict these tokens based on the surrounding information. NSP employs a binary classification loss for the prediction of whether the two sentences are adjacent to each other or not. It is noteworthy that, while selecting positive next sentences is trivial, generating negative ones is often much more difficult. Originally, they used a random next sentence as a negative. They used the Adam optimizer with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE, and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modeling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. To extend this single static mask, the authors duplicated training data 10 times so that each sequence was masked in 10 different ways. The model was trained on those data for 40 epochs, with each sequence with the same mask being used 4 times.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objective. They tried different settings to show it would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document chosen at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: This is the same as the segment-pair representation but with pairs of sentences instead of segments. However, the total length of sequences here would be a lot less than 512. Hence, a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, with the difference that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. However, the authors chose to use FULL-SENTENCES for convenience sake, since the DOC-SENTENCES results in variable batch sizes.<br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes. The result suggested the original BERT batch size was too small. The authors used 8k batch size in the remainder of their experiments. <br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE. BPE is a hybrid between character and word level modeling based on sub-word units. The authors also used a vocabulary size of 50k rather than 30k at the original BERT implementation did, this increases the total parameters by approximately 15M to 20M for the BERT based and BERT large respectively. This change actually results in slight degradation of end-task performance in some cases, however, the authors preferred having the ability to universally encode text without introducing the "unknown" token.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all the aforementioned modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
In conjunction with the modifications of dynamic masking, full-sentences whiteout an NLP loss, large mini-batches, and a larger byte-level BPE.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
<div align="center">'''Table 5:''' RoBERTa's developement set results during pretraining over more data (160GB from 16GB) and longer duration (100K to 300K to 500K steps). Results are accumulated from each row. </div><br />
<br />
RoBERTa has outperformed state-of-the-art algorithms in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the below table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
Table 3 presents the results for the experiments. RoBERTa offers major improvements over BERT (Large). Three additional datasets are added with the original dataset with the original step numbers (100K). In total, 160GB is used for pretraining. Finally, RoBERTa is pretrained for far more steps. Steps were increased from 100K to 300K and then to 500K. Improvements were observed across all downstream tasks. With the 500K steps, XLNet(large) is also outperformed across most tasks.<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will achieve the same performances as RoBERTa.<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa]. The original repository for RoBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in [https://github.com/Yaozeng/roberta-tf PyTorch to TensorFlow] repository as well.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services&diff=48811Adacompress: Adaptive compression for online computer vision services2020-12-02T03:56:28Z<p>Mrasooli: /* Problem Formulation */</p>
<hr />
<div><br />
== Presented by == <br />
Ahmed Hussein Salamah<br />
<br />
== Introduction == <br />
<br />
Big data and deep learning have been merged to create the great success of artificial intelligence which increases the burden on the network's speed, computational complexity, and storage in many applications. In recent literature studies, deep neural networks out-performed in image classification, one of the main tasks in the computer vision domain. Recently, they tend to use different image classification models on the cloud to share the computational power between the different users as mentioned in this paper (e.g., SenseTime, Baidu Vision and Google Vision, etc.). Most of the researchers in the literature work to improve the structure and increase the depth of DNNs to achieve better performance from the point of how the features are represented and crafted using Conventional Neural Networks (CNNs). Most well-known image classification datasets (e.g. ImageNet) are compressed using JPEG which is commonly used compression technique. JPEG is optimized for Human Visual System (HVS) but not the machines (i.e. DNNs). To be aligned with HVS the authors reconfigure the JPEG while maintaining the same classification accuracy. <br />
<br />
'''Why is image compression important?'''<br />
<br />
Image compression is crucial in deep learning because we want the image data to take up less disk space and be loaded faster. Compared to the lossless compression PNG, which preserves the original image data, JPEG is a lossy form of compression meaning some information will be lost for the benefit of an improved compression ratio. Therefore, it is important to develop deep learning model-based image compression methods which reduce data size without jeopardizing classification accuracy. Some examples of this type of image compression includes the LSTM-based approach proposed by Google [9], the transformation-based method from New York University [10], the autoencoder-based approach by Twitter [11], and etc.<br />
<br />
== Methodology ==<br />
<br />
[[File: ada-fig2.PNG | 400px | center]]<br />
<div align="center">'''Figure 1:''' Comparing to the conventional solution, the authors [1] solution can update the compression strategy based on the backend model feedback </div><br />
<br />
One of the major parameters that can be changed in the JPEG pipeline is the quantization table, which is the main source of artifacts added in the image to make it lossless compression as shown in [1, 4]. The authors is motivated to change the JPEG configuration to optimize the uploading rate of different cloud computer vision without considering pre-knowledge of the original model and dataset. This contrasts to the authors in [2, 3, 5] where they adjust the JPEG configuration by retraining the parameters or according to the structure of the model. The lack of undefined quantization level decreases the image rate and quality but the deep learning model can still recognize it as shown in [4]. The authors in [1] used Deep Reinforcement learning (DRL) in an online manner to choose the quantization level to upload an image to the cloud for the computer vision model and this is the only approach to design an adaptive JPEG based on ''RL mechanism''.<br />
<br />
The approach is designed based on an interactive training environment which represents any computer vision cloud services. A deep Q neural network agent is used to evaluate and predict the performance of quantization level on an uploaded image. They feed the agent with a reward function which considers two optimization parameters: accuracy and image size. It works like an iterative behavior interacting with the environment. The environment is exposed to different images with different virtual redundant information that needs an adaptive solution for each image to select the suitable compression level for the model. Thus, they design an explore-exploit mechanism to train the agent on different scenery which is designed in deep Q agent as an inference-estimate-retain mechanism to restart the training procedure for each image. The authors verify their approach by providing some analysis and insight using Grad-Cam [8] by showing some patterns of how a compression level is chosen for each image with its own corresponding quality factor. Each image shows a different response when shown to a deep learning model. In general, images are more sensitive to compression if they have large smooth areas, while those with complex textures are more robust to compression.<br />
<br />
'''What is a quantization table?'''<br />
<br />
Before getting to the quantization table first look at the basic architecture of JPEG's baseline system. This has 4 blocks, which are FDCT (Fast Discrete Cosine Transformation), quantizer, statistical model, and entropy encoder. The FCDT block takes an input image separated into <math> n \times n </math> blocks and applies a discrete cosine transformation creating DCT terms. These DCT terms are values from a relatively large discrete set that will be then mapped through the process of quantization to a smaller discrete set. This is accomplished with a quantization table at the quantizer block, which is designed to preserve low-frequency information at the cost of the high-frequency information. This preference for low frequency information is made because losing high frequency information isn't as impactful to the image when perceived by a humans visual system.<br />
<br />
== Problem Formulation ==<br />
<br />
The authors formulate the problem by referring to the cloud deep learning service as <math> \vec{y}_i = M(x_i)</math> to predict results list <math> \vec{y}_i </math> for an input image <math> x_i </math>, and for reference input <math> x \in X_{\rm ref} </math> the output is <math> \vec{y}_{\rm ref} = M(x_{\rm ref}) </math>. It is referred <math> \vec{y}_{\rm ref} </math> as the ground truth label and also <math> \vec{y}_c = M(x_c) </math> for compressed image <math> x_{c} </math> with quality factor <math> c </math>.<br />
<br />
<br />
\begin{align} \tag{1} \label{eq:accuracy}<br />
\mathcal{A} =& \sum_{k}\min_jd(l_j, g_k) \\ <br />
& l_j \in \vec{y}_c, \quad j=1,...,5 \nonumber \\<br />
& g_k \in \vec{y}_{\rm ref}, \quad k=1, ..., {\rm length}(\vec{y}_{\rm ref}) \nonumber \\<br />
& d(x, y) = 1 \ \text{if} \ x=y \ \text{else} \ 0 \nonumber<br />
\end{align}<br />
<br />
The authors divided the used datasets according to their contextual group <math> X </math> according to [6] and they compare their results using compression ratio <math> \Delta s = \frac{s_c}{s_{\rm ref}} </math>, where <math>s_{c}</math> is the compressed size and <math>s_{\rm ref}</math> is the original size, and accuracy metric <math> \mathcal{A}_c </math> which is calculated based on the hamming distance of Top-5 of the output of softmax probabilities for both original and compressed images as shown in Eq. \eqref{eq:accuracy}. In the RL designing stage, continuous numerical vectors are represented as the input features to the DRL agent which is Deep Q Network (DQN). The challenges of using this approach are: <br />
(1) The state space of RL is too large to cover, so the neural network is typically constructed with more convolutional and fully-connected layers, which makes the DRL agent hard to converge and the training time-consuming; <br />
(2) The DRL always starts with a random initial state, but it needs to find a higher reward before starting the training of the DQN. However, the sparse reward feedback resulted from a random initialization makes learning difficult.<br />
The authors solve this problem by using a pre-trained compact model called MobileNetV2 as a feature extractor <math> \mathcal{E} </math> for its ability in lightweight and image classification, and it is fixed during training the Q Network <math> \phi </math>. The last convolution layer of <math> \mathcal{E} </math> is connected as an input to the Q Network <math>\phi </math>, so by optimizing the parameters of Q network <math> \phi </math>, the RL agent's policy is updated.<br />
<br />
==Reinforcement learning framework==<br />
<br />
This paper [1] described the reinforcement learning problem as <math> \{\mathcal{X}, M\} </math> to be ''emulator environment'', where <math> \mathcal{X} </math> is defining the contextual information created as an input from the user <math> x </math> and <math> M </math> is the backend cloud model. Each RL frame must be defined by ''action and state'', the action is known by 10 discrete quality levels ranging from 5 to 95 by step size of 10 and the state is feature extractor's output <math> \mathcal{E}(J(\mathcal{X}, c)) </math>, where <math> J(\cdot) </math> is the JPEG output at specific quantization level <math> c </math>. They found the optimal quantization level at time <math> t </math> is <math> c_t = {\rm argmax}_cQ(\phi(\mathcal{E}(f_t)), c; \theta) </math>, where <math> Q(\phi(\mathcal{E}(f_t)), c; \theta) </math> is action-value function, <math> \theta </math> indicates the parameters of Q network <math> \phi </math>. In the training stage of RL, the goal is to minimize a loss function <math> L_i(\theta_i) = \mathbb{E}_{s, c \sim \rho (\cdot)}\Big[\big(y_i - Q(s, c; \theta_i)\big)^2 \Big] </math> that changes at each iteration <math> i </math> where <math> s = \mathcal{E}(f_t) </math> and <math>f_t</math> is the output of the JPEG, and <math> y_i = \mathbb{E}_{s' \sim \{\mathcal{X}, M\}} \big[ r + \gamma \max_{c'} Q(s', c'; \theta_{i-1}) \mid s, c \big] </math> is the target that has a probability distribution <math> \rho(s, c) </math> over sequences <math> s </math> and quality level <math> c </math> at iteration <math> i </math>, and <math> r </math> is the feedback reward. <br />
<br />
The framework get more accurate estimation from a selected action when the distance of the target and the action-value function's output <math> Q(\cdot)</math> is minimized. As a results, no feedback signal can tell that an episode has finished a condition value <math>T</math> that satisfies <math> t \geq T_{\rm start} </math> to guarantee to store enough transitions in the memory buffer <math> D </math> to train on. To create this transitions for the RL agent, random trials are collected to observe environment reaction. After fetching some trials from the environment with their corresponding rewards, this randomness is decreased as the agent is trained to minimize the loss function <math> L </math> as shown in the Algorithm below. Thus, it optimizes its actions on a minibatch from <math> \mathcal{D} </math> to be based on historical optimal experience to train the compression level predictor <math> \phi </math>. When this trained predictor <math> \phi </math> is deployed, the RL agent will drive the compression engine with the adaptive quality factor <math> c </math> corresponding to the input image <math> x_{i} </math>. <br />
<br />
The interaction between the agent and environment <math> \{\mathcal{X}, M\} </math> is evaluated using the reward function, which is formulated, by selecting an appropriate action of quality factor <math> c </math>, to be directly proportional to the accuracy metric <math> \mathcal{A}_c </math>, and inversely proportional to the compression rate <math> \Delta s = \frac{s_c}{s_{\rm ref}} </math>. As a result, the reward function is given by <math> R(\Delta s, \mathcal{A}) = \alpha \mathcal{A} - \Delta s + \beta</math>, where <math> \alpha </math> and <math> \beta </math> to form a linear combination.<br />
<br />
[[File:Alg2.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Algroithim :''' Training RL agent <math> \phi </math> in environment <math> \{\mathcal{X}, M\} </math> </div><br />
<br />
== Inference-Estimate-Retrain Mechanism ==<br />
The system diagram, AdaCompress, is shown in figure 3 in contrast to the existing modules. When the AdaCompress is deployed, the input images scenery context <math> \mathcal{X} </math> may change, in this case the RL agent’s compression selection strategy may cause the overall accuracy to decrease. So, in order to solve this issue, the estimator will be invoked with probability <math>p_{\rm est} </math>. This will be done by generating a random value <math> \xi \in (0,1) </math> and the estimator will be invoked if <math>\xi \leq p_{\rm est}</math>. Then AdaCompress will upload both the original image and the compressed image to fetch their labels. The accuracy will then be calculated and the transition, which also includes the accuracy in this step, will be stored in the memory buffer. Comparing recent the n steps' average accuracy with earliest average accuracy, the estimator will then invoke the RL training kernel to retrain if the recent average accuracy is much lower than the initial average accuracy.<br />
<br />
[[File: diagfig.png|500px|center]]<br />
<br />
The authors solved the change in the scenery at the inference phase that might cause learning to diverge by introducing '''running-estimate-retain mechanism'''. They introduced estimator with probability <math> p_{\rm est} </math> that changes in an adaptive way and it is compared a generated random value <math> \xi \in (0,1) </math>. As shown in Figure 2, Adacompression is switching between three states in an adaptive way as will be shown in the following sections.<br />
<br />
[[File:fig3.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 2:''' State Switching Policy </div><br />
<br />
=== Inference State ===<br />
The '''inference state''' is running most of the time at which the deployed RL agent is trained and used to predict the compression level <math> c </math> to be uploaded to the cloud with minimum uploading traffic load. The agent will eventually switch to the estimator stage with probability <math> p_{\rm est} </math> so it will be robust to any change in the scenery to have a stable accuracy. The <math> p_{\rm est} </math> is fixed at the inference stage but changes in an adaptive way as a function of accuracy gradient in the next stage. In '''estimator state''', there will be a trade off between the objective of reducing upload traffic and the risk of changing the scenery, an accuracy-aware dynamic <math> p'_{\rm est} </math> is designed to calculate the average accuracy <math> \mathcal{A}_n </math> after running for defined <math> N </math> steps according to Eq. \ref{eqn:accuracy_n}.<br />
\begin{align} \tag{2} \label{eqn:accuracy_n}<br />
\bar{\mathcal{A}_n} &=<br />
\begin{cases}<br />
\frac{1}{n}\sum_{i=N-n}^{N} \mathcal{A}_i & \text{ if } N \geq n \\ <br />
\frac{1}{n}\sum_{i=1}^{n} \mathcal{A}_i & \text{ if } N < n <br />
\end{cases}<br />
\end{align}<br />
===Estimator State===<br />
The '''estimator state''' is executed when <math> \xi \leq p_{\rm est} </math> is satisfied , where the uploaded traffic is increased as the both the reference image <math> x_{ref} </math> and compressed image <math> x_{i} </math> are uploaded to the cloud to calculate <math> \mathcal{A}_i </math> based on <math> \vec{y}_{\rm ref} </math> and <math> \vec{y}_i </math>. It will be stored in the memory buffer <math> \mathcal{D} </math> as a transition <math> (\phi_i, c_i, r_i, \mathcal{A}_i) </math> of trial <math>i</math>. The estimator will not be anymore suitable for the latest <math>n</math> step when the average accuracy <math> \bar{\mathcal{A}}_n </math> is lower than the earliest <math>n</math> steps of the average <math> \mathcal{A}_0 </math> in the memory buffer <math> \mathcal{D} </math>. Consequently, <math> p_{\rm est} </math> should be changed to higher value to make the estimate stage frequently happened.It is obviously should be a function in the gradient of the average accuracy <math> \bar{\mathcal{A}}_n </math> in such a way to fell the buffer memory <math> \mathcal{D} </math> with some transitions to retrain the agent at a lower average accuracy <math> \bar{\mathcal{A}}_n </math>. The authors formulate <math> p'_{\rm est} = p_{\rm est} + \omega \nabla \bar{\mathcal{A}} </math> and <math> \omega </math> is a scaling factor. Initially the estimated probability <math> p_0 </math> will be a function of <math> p_{\rm est} </math> in the general form of <math>p_{\rm est} = p_0 + \omega \sum_{i=0}^{N} \nabla \bar{\mathcal{A}_i} </math>. <br />
<br />
===Retrain State===<br />
In '''retrain state''', the RL agent is trained to adapt on the change of the input scenery on the stored transitions in the buffer memory <math> \mathcal{D} </math>. The retain stage is finished at the recent <math> n </math> steps when the average reward <math> \bar{r}_n </math> is higher than a defined <math> r_{th}</math> by the user. Afterward, a new retraining stage should be prepared by saving new next transitions after flushing the old buffer memory <math> \mathcal{D}</math>. The authors supported their compression choice for different cloud application environments by providing some insights by introducing a visualization algorithm [8] to some images with their corresponding quality factor <math> c </math>. The visualization shows that the agent chooses a certain quantization level <math> c </math> based on the visual textures in the image at the different regions. For an instant, a low-quality factor is selected for the rough central region so there is a smooth area surrounded it but for the surrounding smooth region, the agent chooses a relatively higher quality rather than the central region.<br />
<br />
<br />
<br />
==Insight of RL agent’s behavior==<br />
In the inference state, the RL agent predicts a proper compression level based on the features of the input image. In the next subsection, we will see that this compression level varies for different image sets and backend cloud services. Also, by taking a look at the attention maps for some of the images, we will figure out why the agent has chosen this compression level.<br />
===Compression level choice variation===<br />
In Figure 5, for Face++ and Amazon Rekognition, the agent’s choices are mostly around compression level = 15, but for Baidu Vision, the agent’s choices are distributed more evenly. Therefore, the backend strategy really affects the choice for the optimal compression level.<br />
<br />
[[File:comp-level1.PNG|500px|center|fig: running-retrain]]<br />
In figure 6, we will see how the agent's behaviour in selecting the optimal compression level changes for different datasets. The two datasets, ImageNet and DNIM present different contextual sceneries. The images mostly taken at daytime were randomly selected from ImageNet and the images mostly taken at the night time were selected from DNIM. The figure 6 shows that for DNiM images, the agent's choices are mostly concentrated in relatively high compression levels, whereas for ImageNet dataset, the agent's choices are distributed more evenly. <br />
<br />
[[File:comp-level2.PNG|500px|center|fig: running-retrain]]<br />
<br />
<br />
<br />
<br />
<br />
<br />
== Results ==<br />
The authors reported in Figure 3, 3 different cloud services compared to the benchmark images. It is shown that more than the half of the upload size while roughly preserving the top-5 accuracy calculated by using A with an average of 7% proving the efficiency of the design. In Figure 4, it shows the ''' inference-estimate-retain ''' mechanism as the x-axis indicates steps, while <math> \Delta </math> mark on <math>x</math>-axis is reveal as a change in the scenery. In Figure 4, the estimating probability <math> p_{\rm est} </math> and the accuracy are inversely proportion as the accuracy drops below the initial value the <math> p_{\rm est} </math> increase adaptive as it considers the accuracy metric <math> \mathcal{A}_c </math> each action <math> c </math> making the average accuracy to decrease in the next estimations. At the red vertical line, the scenery started to change and <math>Q</math> Network start to retrain to adapt the the agent on the current scenery. At retrain stage, the output result is always use from the reference image's prediction label <math> \vec{y}_{\rm ref} </math>. <br />
Also, they plotted the scaled uploading data size of the proposed algorithm and the overhead data size for the benchmark is shown in the inference stage. After the average accuracy became stable and high, the transmission is reduced by decreasing the <math> p_{\rm est} </math> value. As a result, <math> p_{\rm est} </math> and <math> \mathcal{A} </math> will be always equal to 1. During this stage, the uploaded file is more than the conventional benchmark. In the inference stage, the uploaded size is halved as shown in both Figures 3, 4.<br />
[[File:upload overhead.png|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 3:''' Difference in overhead of size during training and inference phase </div><br />
<br />
[[File:ada-fig9.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 4:''' Different cloud services compared relative to average size and accuracy </div><br />
<br />
<br />
[[File:ada-fig10.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 5:''' Scenery change response from AdaCompress Algorithm </div><br />
<br />
==Conclusion==<br />
<br />
Most of the research focused on modifying the deep learning model instead of dealing with the currently available approaches. The authors succeed in defining the compression level for each uploaded image to decrease the size and maintain the top-5 accuracy in a robust manner even the scenery is changed. <br />
In my opinion, Eq. \eqref{eq:accuracy} is not defined well as I found it does not really affect the reward function. Also, they did not use the whole training set from ImageNet which raises the question of what is the higher file size that they considered from in the mention current set. In addition, if they considered the whole data set, should we expect the same performance for the mechanism or better. I believe it would be better in both accuracy and compression.<br />
<br />
== Critiques == <br />
<br />
The authors used a pre-trained model as a feature extractor to select a Quality Factor (QF) for the JPEG. I think what would be missing that they did not report the distribution of each of their span of QFs as it is important to understand which one is expected to contribute more to the whole datasets used. The authors did not run their approach on a complete database like ImageNet, they only included a part of two different datasets. I know they might have limitations in the available datasets to test like CIFARs, as they are not totally comparable from the resolution perspective for the real online computer vision services work with higher resolutions. <br />
In the next section, I have done one experiment using Inception-V3 to understand if it is possible to get better accuracy. I found that it is possible by using the inception model as a pre-trained model to choose a lower QF, but as well known that the mobile models are shallower than the inception models which makes it less complex to run on edge devices. I think it is possible to achieve at least the same accuracy or even more if we replaced the mobile model with the inception as shown in the section section.<br />
<br />
=== Extra Analysis ===<br />
In the following figure, I took a single image from ImageNet with the ground truth of ''' Sea Sneak ''' and encoded it with a QF of 20. I run the inference on Inception V3 that is benchmarked by TensorFlow. The Human Visual System (HVS) can not recognize it. Therefore, we should expect that the trained model will not get the right ground truth as the same model is aligned with the HVS as they both trained on the same perception. In table 1, showing that the compressed image will be more recognizable by the machine than the human to be within the top-5 accuracy, where we expected that the machine will not be to recognize it. This means that the machine has a different perception that the Human visual system.<br />
<br />
<br />
<br/><br />
[[File:adacomp_sea_snake.jpg|500px|center]]<br />
<br/><br />
<div align="center">'''Figure 5:''' Sea Snake Image from ImageNet compressed with QF = 20 </div><br />
<br />
<br/><br />
[[File:adacomp table 3.PNG|500px|center]]<br />
<br/><br />
<div align="center">'''Table 1:''' Sea Snake Image prediction probability using the original image and the compressed one</div><br />
<br />
== Source Code ==<br />
<br />
https://github.com/AhmedHussKhalifa/AdaCompress<br />
<br />
== References ==<br />
<br />
[1] Hongshan Li, Yu Guo, Zhi Wang, Shutao Xia, and Wenwu Zhu, “Adacompress: Adaptive compression for online computer vision services,” in Proceedings of the 27th ACM International Conference on Multimedia, New York, NY, USA, 2019, MM ’19, pp. 2440–2448, ACM.<br />
<br />
[2] Zihao Liu, Tao Liu, Wujie Wen, Lei Jiang, Jie Xu, Yanzhi Wang, and Gang Quan, “DeepN-JPEG: A deep neural network favorable JPEG-based image compression<br />
framework,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 18.<br />
<br />
[3] Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu, and Jason Yosinski, “Faster neural networks straight from jpeg,” in Advances in Neural Information Processing Systems, 2018, pp. 3933–3944.<br />
<br />
[4] Kresimir Delac, Mislav Grgic, and Sonja Grgic, “Effects of jpeg and jpeg2000 compression on face recognition,” in Pattern Recognition and Image Analysis, Sameer Singh, Maneesha Singh, Chid Apte, and Petra Perner, Eds., Berlin, Heidelberg, 2005, pp. 136–145, Springer Berlin Heidelberg.<br />
<br />
[5] Robert Torfason, Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool, “Towards image understanding from deep compression without decoding,” 2018.<br />
<br />
[6] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy, “Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, New York, NY, USA, 2016, MobiSys ’16, pp. 123–136, ACM.<br />
<br />
[7] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018.<br />
<br />
[8] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra, “Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization,” CoRR, vol. abs/1610.02391, 2016.<br />
<br />
[9] George Toderici, Sean M. O'Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, Rahul Sukthankarm, Variable Rate Image Compression with Recurrent Neural Networks, ICLR 2016, arXiv:1511.06085<br />
<br />
[10] Johannes Ballé, Valero Laparra, Eero P. Simoncelli, End-to-end Optimized Image Compression, ICLR 2017, arXiv:1611.01704<br />
<br />
[11] Lucas Theis, Wenzhe Shi, Andrew Cunningham, Ferenc Huszár, Lossy Image Compression with Compressive Autoencoders, ICLR 2017, arXiv:1703.00395</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services&diff=48809Adacompress: Adaptive compression for online computer vision services2020-12-02T03:52:24Z<p>Mrasooli: /* Introduction */</p>
<hr />
<div><br />
== Presented by == <br />
Ahmed Hussein Salamah<br />
<br />
== Introduction == <br />
<br />
Big data and deep learning have been merged to create the great success of artificial intelligence which increases the burden on the network's speed, computational complexity, and storage in many applications. In recent literature studies, deep neural networks out-performed in image classification, one of the main tasks in the computer vision domain. Recently, they tend to use different image classification models on the cloud to share the computational power between the different users as mentioned in this paper (e.g., SenseTime, Baidu Vision and Google Vision, etc.). Most of the researchers in the literature work to improve the structure and increase the depth of DNNs to achieve better performance from the point of how the features are represented and crafted using Conventional Neural Networks (CNNs). Most well-known image classification datasets (e.g. ImageNet) are compressed using JPEG which is commonly used compression technique. JPEG is optimized for Human Visual System (HVS) but not the machines (i.e. DNNs). To be aligned with HVS the authors reconfigure the JPEG while maintaining the same classification accuracy. <br />
<br />
'''Why is image compression important?'''<br />
<br />
Image compression is crucial in deep learning because we want the image data to take up less disk space and be loaded faster. Compared to the lossless compression PNG, which preserves the original image data, JPEG is a lossy form of compression meaning some information will be lost for the benefit of an improved compression ratio. Therefore, it is important to develop deep learning model-based image compression methods which reduce data size without jeopardizing classification accuracy. Some examples of this type of image compression includes the LSTM-based approach proposed by Google [9], the transformation-based method from New York University [10], the autoencoder-based approach by Twitter [11], and etc.<br />
<br />
== Methodology ==<br />
<br />
[[File: ada-fig2.PNG | 400px | center]]<br />
<div align="center">'''Figure 1:''' Comparing to the conventional solution, the authors [1] solution can update the compression strategy based on the backend model feedback </div><br />
<br />
One of the major parameters that can be changed in the JPEG pipeline is the quantization table, which is the main source of artifacts added in the image to make it lossless compression as shown in [1, 4]. The authors is motivated to change the JPEG configuration to optimize the uploading rate of different cloud computer vision without considering pre-knowledge of the original model and dataset. This contrasts to the authors in [2, 3, 5] where they adjust the JPEG configuration by retraining the parameters or according to the structure of the model. The lack of undefined quantization level decreases the image rate and quality but the deep learning model can still recognize it as shown in [4]. The authors in [1] used Deep Reinforcement learning (DRL) in an online manner to choose the quantization level to upload an image to the cloud for the computer vision model and this is the only approach to design an adaptive JPEG based on ''RL mechanism''.<br />
<br />
The approach is designed based on an interactive training environment which represents any computer vision cloud services. A deep Q neural network agent is used to evaluate and predict the performance of quantization level on an uploaded image. They feed the agent with a reward function which considers two optimization parameters: accuracy and image size. It works like an iterative behavior interacting with the environment. The environment is exposed to different images with different virtual redundant information that needs an adaptive solution for each image to select the suitable compression level for the model. Thus, they design an explore-exploit mechanism to train the agent on different scenery which is designed in deep Q agent as an inference-estimate-retain mechanism to restart the training procedure for each image. The authors verify their approach by providing some analysis and insight using Grad-Cam [8] by showing some patterns of how a compression level is chosen for each image with its own corresponding quality factor. Each image shows a different response when shown to a deep learning model. In general, images are more sensitive to compression if they have large smooth areas, while those with complex textures are more robust to compression.<br />
<br />
'''What is a quantization table?'''<br />
<br />
Before getting to the quantization table first look at the basic architecture of JPEG's baseline system. This has 4 blocks, which are FDCT (Fast Discrete Cosine Transformation), quantizer, statistical model, and entropy encoder. The FCDT block takes an input image separated into <math> n \times n </math> blocks and applies a discrete cosine transformation creating DCT terms. These DCT terms are values from a relatively large discrete set that will be then mapped through the process of quantization to a smaller discrete set. This is accomplished with a quantization table at the quantizer block, which is designed to preserve low-frequency information at the cost of the high-frequency information. This preference for low frequency information is made because losing high frequency information isn't as impactful to the image when perceived by a humans visual system.<br />
<br />
== Problem Formulation ==<br />
<br />
The authors formulate the problem by referring to the cloud deep learning service as <math> \vec{y}_i = M(x_i)</math> to predict results list <math> \vec{y}_i </math> for an input image <math> x_i </math>, and for reference input <math> x \in X_{\rm ref} </math> the output is <math> \vec{y}_{\rm ref} = M(x_{\rm ref}) </math>. It is referred <math> \vec{y}_{\rm ref} </math> as the ground truth label and also <math> \vec{y}_c = M(x_c) </math> for compressed image <math> x_{c} </math> with quality factor <math> c </math>.<br />
<br />
<br />
\begin{align} \tag{1} \label{eq:accuracy}<br />
\mathcal{A} =& \sum_{k}\min_jd(l_j, g_k) \\ <br />
& l_j \in \vec{y}_c, \quad j=1,...,5 \nonumber \\<br />
& g_k \in \vec{y}_{\rm ref}, \quad k=1, ..., {\rm length}(\vec{y}_{\rm ref}) \nonumber \\<br />
& d(x, y) = 1 \ \text{if} \ x=y \ \text{else} \ 0 \nonumber<br />
\end{align}<br />
<br />
The authors divided the used datasets according to their contextual group <math> X </math> according to [6] and they compare their results using compression ratio <math> \Delta s = \frac{s_c}{s_{\rm ref}} </math>, where <math>s_{c}</math> is the compressed size and <math>s_{\rm ref}</math> is the original size, and accuracy metric <math> \mathcal{A}_c </math> which is calculated based on the hamming distance of Top-5 of the output of softmax probabilities for both original and compressed images as shown in Eq. \eqref{eq:accuracy}. In the RL designing stage, continuous numerical vectors are represented as the input features to the DRL agent which is Deep Q Network (DQN). The challenges of using this approach are: <br />
(1) The state space of RL is too large to cover, so the neural network is typically constructed with more layers and nodes, which makes the DRL agent hard to converge and the training time-consuming; <br />
(2) The DRL always starts with a random initial state, but it needs to find a higher reward before starting the training of the DQN. However, the sparse reward feedback resulted from a random initialization makes learning difficult.<br />
The authors solve this problem by using a pre-trained small model called MobileNetV2 as a feature extractor <math> \mathcal{E} </math> for its ability in lightweight and image classification, and it is fixed during training the Q Network <math> \phi </math>. The last convolution layer of <math> \mathcal{E} </math> is connected as an input to the Q Network <math>\phi </math>, so by optimizing the parameters of Q network <math> \phi </math>, the RL agent's policy is updated.<br />
<br />
==Reinforcement learning framework==<br />
<br />
This paper [1] described the reinforcement learning problem as <math> \{\mathcal{X}, M\} </math> to be ''emulator environment'', where <math> \mathcal{X} </math> is defining the contextual information created as an input from the user <math> x </math> and <math> M </math> is the backend cloud model. Each RL frame must be defined by ''action and state'', the action is known by 10 discrete quality levels ranging from 5 to 95 by step size of 10 and the state is feature extractor's output <math> \mathcal{E}(J(\mathcal{X}, c)) </math>, where <math> J(\cdot) </math> is the JPEG output at specific quantization level <math> c </math>. They found the optimal quantization level at time <math> t </math> is <math> c_t = {\rm argmax}_cQ(\phi(\mathcal{E}(f_t)), c; \theta) </math>, where <math> Q(\phi(\mathcal{E}(f_t)), c; \theta) </math> is action-value function, <math> \theta </math> indicates the parameters of Q network <math> \phi </math>. In the training stage of RL, the goal is to minimize a loss function <math> L_i(\theta_i) = \mathbb{E}_{s, c \sim \rho (\cdot)}\Big[\big(y_i - Q(s, c; \theta_i)\big)^2 \Big] </math> that changes at each iteration <math> i </math> where <math> s = \mathcal{E}(f_t) </math> and <math>f_t</math> is the output of the JPEG, and <math> y_i = \mathbb{E}_{s' \sim \{\mathcal{X}, M\}} \big[ r + \gamma \max_{c'} Q(s', c'; \theta_{i-1}) \mid s, c \big] </math> is the target that has a probability distribution <math> \rho(s, c) </math> over sequences <math> s </math> and quality level <math> c </math> at iteration <math> i </math>, and <math> r </math> is the feedback reward. <br />
<br />
The framework get more accurate estimation from a selected action when the distance of the target and the action-value function's output <math> Q(\cdot)</math> is minimized. As a results, no feedback signal can tell that an episode has finished a condition value <math>T</math> that satisfies <math> t \geq T_{\rm start} </math> to guarantee to store enough transitions in the memory buffer <math> D </math> to train on. To create this transitions for the RL agent, random trials are collected to observe environment reaction. After fetching some trials from the environment with their corresponding rewards, this randomness is decreased as the agent is trained to minimize the loss function <math> L </math> as shown in the Algorithm below. Thus, it optimizes its actions on a minibatch from <math> \mathcal{D} </math> to be based on historical optimal experience to train the compression level predictor <math> \phi </math>. When this trained predictor <math> \phi </math> is deployed, the RL agent will drive the compression engine with the adaptive quality factor <math> c </math> corresponding to the input image <math> x_{i} </math>. <br />
<br />
The interaction between the agent and environment <math> \{\mathcal{X}, M\} </math> is evaluated using the reward function, which is formulated, by selecting an appropriate action of quality factor <math> c </math>, to be directly proportional to the accuracy metric <math> \mathcal{A}_c </math>, and inversely proportional to the compression rate <math> \Delta s = \frac{s_c}{s_{\rm ref}} </math>. As a result, the reward function is given by <math> R(\Delta s, \mathcal{A}) = \alpha \mathcal{A} - \Delta s + \beta</math>, where <math> \alpha </math> and <math> \beta </math> to form a linear combination.<br />
<br />
[[File:Alg2.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Algroithim :''' Training RL agent <math> \phi </math> in environment <math> \{\mathcal{X}, M\} </math> </div><br />
<br />
== Inference-Estimate-Retrain Mechanism ==<br />
The system diagram, AdaCompress, is shown in figure 3 in contrast to the existing modules. When the AdaCompress is deployed, the input images scenery context <math> \mathcal{X} </math> may change, in this case the RL agent’s compression selection strategy may cause the overall accuracy to decrease. So, in order to solve this issue, the estimator will be invoked with probability <math>p_{\rm est} </math>. This will be done by generating a random value <math> \xi \in (0,1) </math> and the estimator will be invoked if <math>\xi \leq p_{\rm est}</math>. Then AdaCompress will upload both the original image and the compressed image to fetch their labels. The accuracy will then be calculated and the transition, which also includes the accuracy in this step, will be stored in the memory buffer. Comparing recent the n steps' average accuracy with earliest average accuracy, the estimator will then invoke the RL training kernel to retrain if the recent average accuracy is much lower than the initial average accuracy.<br />
<br />
[[File: diagfig.png|500px|center]]<br />
<br />
The authors solved the change in the scenery at the inference phase that might cause learning to diverge by introducing '''running-estimate-retain mechanism'''. They introduced estimator with probability <math> p_{\rm est} </math> that changes in an adaptive way and it is compared a generated random value <math> \xi \in (0,1) </math>. As shown in Figure 2, Adacompression is switching between three states in an adaptive way as will be shown in the following sections.<br />
<br />
[[File:fig3.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 2:''' State Switching Policy </div><br />
<br />
=== Inference State ===<br />
The '''inference state''' is running most of the time at which the deployed RL agent is trained and used to predict the compression level <math> c </math> to be uploaded to the cloud with minimum uploading traffic load. The agent will eventually switch to the estimator stage with probability <math> p_{\rm est} </math> so it will be robust to any change in the scenery to have a stable accuracy. The <math> p_{\rm est} </math> is fixed at the inference stage but changes in an adaptive way as a function of accuracy gradient in the next stage. In '''estimator state''', there will be a trade off between the objective of reducing upload traffic and the risk of changing the scenery, an accuracy-aware dynamic <math> p'_{\rm est} </math> is designed to calculate the average accuracy <math> \mathcal{A}_n </math> after running for defined <math> N </math> steps according to Eq. \ref{eqn:accuracy_n}.<br />
\begin{align} \tag{2} \label{eqn:accuracy_n}<br />
\bar{\mathcal{A}_n} &=<br />
\begin{cases}<br />
\frac{1}{n}\sum_{i=N-n}^{N} \mathcal{A}_i & \text{ if } N \geq n \\ <br />
\frac{1}{n}\sum_{i=1}^{n} \mathcal{A}_i & \text{ if } N < n <br />
\end{cases}<br />
\end{align}<br />
===Estimator State===<br />
The '''estimator state''' is executed when <math> \xi \leq p_{\rm est} </math> is satisfied , where the uploaded traffic is increased as the both the reference image <math> x_{ref} </math> and compressed image <math> x_{i} </math> are uploaded to the cloud to calculate <math> \mathcal{A}_i </math> based on <math> \vec{y}_{\rm ref} </math> and <math> \vec{y}_i </math>. It will be stored in the memory buffer <math> \mathcal{D} </math> as a transition <math> (\phi_i, c_i, r_i, \mathcal{A}_i) </math> of trial <math>i</math>. The estimator will not be anymore suitable for the latest <math>n</math> step when the average accuracy <math> \bar{\mathcal{A}}_n </math> is lower than the earliest <math>n</math> steps of the average <math> \mathcal{A}_0 </math> in the memory buffer <math> \mathcal{D} </math>. Consequently, <math> p_{\rm est} </math> should be changed to higher value to make the estimate stage frequently happened.It is obviously should be a function in the gradient of the average accuracy <math> \bar{\mathcal{A}}_n </math> in such a way to fell the buffer memory <math> \mathcal{D} </math> with some transitions to retrain the agent at a lower average accuracy <math> \bar{\mathcal{A}}_n </math>. The authors formulate <math> p'_{\rm est} = p_{\rm est} + \omega \nabla \bar{\mathcal{A}} </math> and <math> \omega </math> is a scaling factor. Initially the estimated probability <math> p_0 </math> will be a function of <math> p_{\rm est} </math> in the general form of <math>p_{\rm est} = p_0 + \omega \sum_{i=0}^{N} \nabla \bar{\mathcal{A}_i} </math>. <br />
<br />
===Retrain State===<br />
In '''retrain state''', the RL agent is trained to adapt on the change of the input scenery on the stored transitions in the buffer memory <math> \mathcal{D} </math>. The retain stage is finished at the recent <math> n </math> steps when the average reward <math> \bar{r}_n </math> is higher than a defined <math> r_{th}</math> by the user. Afterward, a new retraining stage should be prepared by saving new next transitions after flushing the old buffer memory <math> \mathcal{D}</math>. The authors supported their compression choice for different cloud application environments by providing some insights by introducing a visualization algorithm [8] to some images with their corresponding quality factor <math> c </math>. The visualization shows that the agent chooses a certain quantization level <math> c </math> based on the visual textures in the image at the different regions. For an instant, a low-quality factor is selected for the rough central region so there is a smooth area surrounded it but for the surrounding smooth region, the agent chooses a relatively higher quality rather than the central region.<br />
<br />
<br />
<br />
==Insight of RL agent’s behavior==<br />
In the inference state, the RL agent predicts a proper compression level based on the features of the input image. In the next subsection, we will see that this compression level varies for different image sets and backend cloud services. Also, by taking a look at the attention maps for some of the images, we will figure out why the agent has chosen this compression level.<br />
===Compression level choice variation===<br />
In Figure 5, for Face++ and Amazon Rekognition, the agent’s choices are mostly around compression level = 15, but for Baidu Vision, the agent’s choices are distributed more evenly. Therefore, the backend strategy really affects the choice for the optimal compression level.<br />
<br />
[[File:comp-level1.PNG|500px|center|fig: running-retrain]]<br />
In figure 6, we will see how the agent's behaviour in selecting the optimal compression level changes for different datasets. The two datasets, ImageNet and DNIM present different contextual sceneries. The images mostly taken at daytime were randomly selected from ImageNet and the images mostly taken at the night time were selected from DNIM. The figure 6 shows that for DNiM images, the agent's choices are mostly concentrated in relatively high compression levels, whereas for ImageNet dataset, the agent's choices are distributed more evenly. <br />
<br />
[[File:comp-level2.PNG|500px|center|fig: running-retrain]]<br />
<br />
<br />
<br />
<br />
<br />
<br />
== Results ==<br />
The authors reported in Figure 3, 3 different cloud services compared to the benchmark images. It is shown that more than the half of the upload size while roughly preserving the top-5 accuracy calculated by using A with an average of 7% proving the efficiency of the design. In Figure 4, it shows the ''' inference-estimate-retain ''' mechanism as the x-axis indicates steps, while <math> \Delta </math> mark on <math>x</math>-axis is reveal as a change in the scenery. In Figure 4, the estimating probability <math> p_{\rm est} </math> and the accuracy are inversely proportion as the accuracy drops below the initial value the <math> p_{\rm est} </math> increase adaptive as it considers the accuracy metric <math> \mathcal{A}_c </math> each action <math> c </math> making the average accuracy to decrease in the next estimations. At the red vertical line, the scenery started to change and <math>Q</math> Network start to retrain to adapt the the agent on the current scenery. At retrain stage, the output result is always use from the reference image's prediction label <math> \vec{y}_{\rm ref} </math>. <br />
Also, they plotted the scaled uploading data size of the proposed algorithm and the overhead data size for the benchmark is shown in the inference stage. After the average accuracy became stable and high, the transmission is reduced by decreasing the <math> p_{\rm est} </math> value. As a result, <math> p_{\rm est} </math> and <math> \mathcal{A} </math> will be always equal to 1. During this stage, the uploaded file is more than the conventional benchmark. In the inference stage, the uploaded size is halved as shown in both Figures 3, 4.<br />
[[File:upload overhead.png|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 3:''' Difference in overhead of size during training and inference phase </div><br />
<br />
[[File:ada-fig9.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 4:''' Different cloud services compared relative to average size and accuracy </div><br />
<br />
<br />
[[File:ada-fig10.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 5:''' Scenery change response from AdaCompress Algorithm </div><br />
<br />
==Conclusion==<br />
<br />
Most of the research focused on modifying the deep learning model instead of dealing with the currently available approaches. The authors succeed in defining the compression level for each uploaded image to decrease the size and maintain the top-5 accuracy in a robust manner even the scenery is changed. <br />
In my opinion, Eq. \eqref{eq:accuracy} is not defined well as I found it does not really affect the reward function. Also, they did not use the whole training set from ImageNet which raises the question of what is the higher file size that they considered from in the mention current set. In addition, if they considered the whole data set, should we expect the same performance for the mechanism or better. I believe it would be better in both accuracy and compression.<br />
<br />
== Critiques == <br />
<br />
The authors used a pre-trained model as a feature extractor to select a Quality Factor (QF) for the JPEG. I think what would be missing that they did not report the distribution of each of their span of QFs as it is important to understand which one is expected to contribute more to the whole datasets used. The authors did not run their approach on a complete database like ImageNet, they only included a part of two different datasets. I know they might have limitations in the available datasets to test like CIFARs, as they are not totally comparable from the resolution perspective for the real online computer vision services work with higher resolutions. <br />
In the next section, I have done one experiment using Inception-V3 to understand if it is possible to get better accuracy. I found that it is possible by using the inception model as a pre-trained model to choose a lower QF, but as well known that the mobile models are shallower than the inception models which makes it less complex to run on edge devices. I think it is possible to achieve at least the same accuracy or even more if we replaced the mobile model with the inception as shown in the section section.<br />
<br />
=== Extra Analysis ===<br />
In the following figure, I took a single image from ImageNet with the ground truth of ''' Sea Sneak ''' and encoded it with a QF of 20. I run the inference on Inception V3 that is benchmarked by TensorFlow. The Human Visual System (HVS) can not recognize it. Therefore, we should expect that the trained model will not get the right ground truth as the same model is aligned with the HVS as they both trained on the same perception. In table 1, showing that the compressed image will be more recognizable by the machine than the human to be within the top-5 accuracy, where we expected that the machine will not be to recognize it. This means that the machine has a different perception that the Human visual system.<br />
<br />
<br />
<br/><br />
[[File:adacomp_sea_snake.jpg|500px|center]]<br />
<br/><br />
<div align="center">'''Figure 5:''' Sea Snake Image from ImageNet compressed with QF = 20 </div><br />
<br />
<br/><br />
[[File:adacomp table 3.PNG|500px|center]]<br />
<br/><br />
<div align="center">'''Table 1:''' Sea Snake Image prediction probability using the original image and the compressed one</div><br />
<br />
== Source Code ==<br />
<br />
https://github.com/AhmedHussKhalifa/AdaCompress<br />
<br />
== References ==<br />
<br />
[1] Hongshan Li, Yu Guo, Zhi Wang, Shutao Xia, and Wenwu Zhu, “Adacompress: Adaptive compression for online computer vision services,” in Proceedings of the 27th ACM International Conference on Multimedia, New York, NY, USA, 2019, MM ’19, pp. 2440–2448, ACM.<br />
<br />
[2] Zihao Liu, Tao Liu, Wujie Wen, Lei Jiang, Jie Xu, Yanzhi Wang, and Gang Quan, “DeepN-JPEG: A deep neural network favorable JPEG-based image compression<br />
framework,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 18.<br />
<br />
[3] Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu, and Jason Yosinski, “Faster neural networks straight from jpeg,” in Advances in Neural Information Processing Systems, 2018, pp. 3933–3944.<br />
<br />
[4] Kresimir Delac, Mislav Grgic, and Sonja Grgic, “Effects of jpeg and jpeg2000 compression on face recognition,” in Pattern Recognition and Image Analysis, Sameer Singh, Maneesha Singh, Chid Apte, and Petra Perner, Eds., Berlin, Heidelberg, 2005, pp. 136–145, Springer Berlin Heidelberg.<br />
<br />
[5] Robert Torfason, Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool, “Towards image understanding from deep compression without decoding,” 2018.<br />
<br />
[6] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy, “Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, New York, NY, USA, 2016, MobiSys ’16, pp. 123–136, ACM.<br />
<br />
[7] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018.<br />
<br />
[8] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra, “Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization,” CoRR, vol. abs/1610.02391, 2016.<br />
<br />
[9] George Toderici, Sean M. O'Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, Rahul Sukthankarm, Variable Rate Image Compression with Recurrent Neural Networks, ICLR 2016, arXiv:1511.06085<br />
<br />
[10] Johannes Ballé, Valero Laparra, Eero P. Simoncelli, End-to-end Optimized Image Compression, ICLR 2017, arXiv:1611.01704<br />
<br />
[11] Lucas Theis, Wenzhe Shi, Andrew Cunningham, Ferenc Huszár, Lossy Image Compression with Compressive Autoencoders, ICLR 2017, arXiv:1703.00395</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features&diff=48808Model Agnostic Learning of Semantic Features2020-12-02T03:46:31Z<p>Mrasooli: /* Conclusion */</p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research in machine learning which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" in which through utilizing a similarity relationship between samples [1], [2] a more robust and discriminative data representation is formed. However, both of these kinds of techniques work insofar as the domain shift, between source and target domains, is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the DNN model to completely fail. The multi-domain learning is the solution when the assumption of "source domain and target domain comes from an almost same distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been vastly leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. The method of MLDG [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
Also known as "learning to learn", Model-agnostic Meta Learning is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test set, the updated weights are calculated using algorithm below.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y}) = - \sum_{c}1[y=C]log(\hat{y}_{c})</math><br />
</div><br />
<br />
Although the task loss is a decent predictor, nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. This issue is considered in other loss terms.<br />
<br />
===Model-Agnostic Learning with Episodic Training===<br />
The key of their learning procedure is an episodic training scheme, originated from model-agnostic meta-learning, to expose the model optimization to distribution mismatch. In line with their goal of domain generalization, the model is trained on a sequence of simulated episodes with domain shift. Specifically, at each iteration, the available domains <math>D</math> are randomly split into sets of meta-train <math>D_{tr}</math> rand meta-test <math>D_{te}</math> domains. The model is trained to semantically perform well on held-out <math>D_{te}</math> after being optimized with one or more steps of gradient descent with <math>D_{tr}</math> domains. In our case, the feature extractor’s and task network’s parameters,ψ and θ, are first updated from the task-specific supervised loss <math>L</math> task(e.g. cross-entropy for classification), computed on meta-train:<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. These relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimizing symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symmetry.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain. Contrastive loss takes two samples as input and makes samples of the same class closer while pushing away samples of different classes.<br />
[[File: contrastive.png | 400px]]<br />
<br />
Conversely, triplet loss takes three samples as input: one anchor, one positive, and one negative. Triplet loss tries to make relevant samples closer than irrelevant ones.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|600px|center]]<br />
<br />
The training architecture and three losses are also illustrated as below:<br />
<br />
[[File:Ashraf99.png|800px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. This dataset contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:Figure2 image.png|600px|center]]<br />
<div align="center">t-SNE visualizations of extracted features.</div> <br />
<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, the authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. The authors attempted to segment the brain images into four classes: background, grey matter, white matter, and cerebrospinal fluid. Tasks such as these have enormous impact in clinical diagnosis and aiding in treatment. For example, designing a similar net to segment between healthy brain tissue and tumorous brain tissue could aid surgeons in brain tumour resection.<br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
A new domain generalization technique by taking the advantage of incorporating global and local constraints for learning semantic feature spaces presented which outperforms the state-of-the-art. The power and effectiveness of this method have been demonstrated using two domain generalization benchmarks, and a real clinical dataset (MRI image segmentation). The code is publicly available at [19]. As future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== Critiques ==<br />
<br />
The purpose of this paper is to help guide learning in semantic feature space by leveraging local similarity. The authors argument may contain essential domain-independent general knowledge for domain generalization to solve this issue. In addition to adopting constructive loss and triplet loss to encourage the clustering for solving this issue. Extracting robust semantic features regardless of domains can be learned by leveraging from the across-domain class similarity information, which is important information during learning. The learner would suffer from indistinct decision boundaries if it could not separate the samples from different source domains with separation on the domain invariant feature space and in-dependent class-specific cohesion. The major problem that will be revealed with large datasets is that these indistinct decision boundaries might still be sensitive to the unseen target domain.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features&diff=48807Model Agnostic Learning of Semantic Features2020-12-02T03:44:45Z<p>Mrasooli: /* PACS */</p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research in machine learning which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" in which through utilizing a similarity relationship between samples [1], [2] a more robust and discriminative data representation is formed. However, both of these kinds of techniques work insofar as the domain shift, between source and target domains, is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the DNN model to completely fail. The multi-domain learning is the solution when the assumption of "source domain and target domain comes from an almost same distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been vastly leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. The method of MLDG [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
Also known as "learning to learn", Model-agnostic Meta Learning is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test set, the updated weights are calculated using algorithm below.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y}) = - \sum_{c}1[y=C]log(\hat{y}_{c})</math><br />
</div><br />
<br />
Although the task loss is a decent predictor, nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. This issue is considered in other loss terms.<br />
<br />
===Model-Agnostic Learning with Episodic Training===<br />
The key of their learning procedure is an episodic training scheme, originated from model-agnostic meta-learning, to expose the model optimization to distribution mismatch. In line with their goal of domain generalization, the model is trained on a sequence of simulated episodes with domain shift. Specifically, at each iteration, the available domains <math>D</math> are randomly split into sets of meta-train <math>D_{tr}</math> rand meta-test <math>D_{te}</math> domains. The model is trained to semantically perform well on held-out <math>D_{te}</math> after being optimized with one or more steps of gradient descent with <math>D_{tr}</math> domains. In our case, the feature extractor’s and task network’s parameters,ψ and θ, are first updated from the task-specific supervised loss <math>L</math> task(e.g. cross-entropy for classification), computed on meta-train:<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. These relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimizing symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symmetry.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain. Contrastive loss takes two samples as input and makes samples of the same class closer while pushing away samples of different classes.<br />
[[File: contrastive.png | 400px]]<br />
<br />
Conversely, triplet loss takes three samples as input: one anchor, one positive, and one negative. Triplet loss tries to make relevant samples closer than irrelevant ones.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|600px|center]]<br />
<br />
The training architecture and three losses are also illustrated as below:<br />
<br />
[[File:Ashraf99.png|800px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. This dataset contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:Figure2 image.png|600px|center]]<br />
<div align="center">t-SNE visualizations of extracted features.</div> <br />
<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, the authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. The authors attempted to segment the brain images into four classes: background, grey matter, white matter, and cerebrospinal fluid. Tasks such as these have enormous impact in clinical diagnosis and aiding in treatment. For example, designing a similar net to segment between healthy brain tissue and tumorous brain tissue could aid surgeons in brain tumour resection.<br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
A new domain generalization technique by taking the advantage of incorporating global and local constraints for learning semantic feature spaces presented which outperforms the state-of-the-art. The effectiveness of this method has been demonstrated using two domain generalization benchmarks, and a real clinical dataset (MRI image segmentation). The code is freely available at [19]. As future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== Critiques ==<br />
<br />
The purpose of this paper is to help guide learning in semantic feature space by leveraging local similarity. The authors argument may contain essential domain-independent general knowledge for domain generalization to solve this issue. In addition to adopting constructive loss and triplet loss to encourage the clustering for solving this issue. Extracting robust semantic features regardless of domains can be learned by leveraging from the across-domain class similarity information, which is important information during learning. The learner would suffer from indistinct decision boundaries if it could not separate the samples from different source domains with separation on the domain invariant feature space and in-dependent class-specific cohesion. The major problem that will be revealed with large datasets is that these indistinct decision boundaries might still be sensitive to the unseen target domain.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation&diff=48806One-Shot Object Detection with Co-Attention and Co-Excitation2020-12-02T03:43:05Z<p>Mrasooli: /* Approach */</p>
<hr />
<div>== Presented By ==<br />
Gautam Bathla<br />
<br />
== Background ==<br />
<br />
Object Detection is a technique where the model gets an image as an input and outputs the class and location of all the objects present in the image.<br />
<br />
[[File:object_detection.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Object Detection on an image</div><br />
<br />
Figure 1 shows an example where the model identifies and locates all the instances of different objects present in the image successfully. It encloses each object within a bounding box and annotates each box with the class of the object present inside the box.<br />
<br />
State-of-the-art object detectors are trained on thousands of images for different classes before the model can accurately predict the class and spatial location for unseen images belonging to the classes the model has been trained on. When a model is trained with K labeled instances for each of N classes, then this setting is known as N-way K-shot classification. K = 0 for zero-shot learning, K = 1 for one-shot learning and K > 1 for few-shot learning.<br />
<br />
== Introduction ==<br />
<br />
This paper tackles the problem of one-shot object detection, where the model needs to find all the instances in the target image of the object in the query image for a given query image ''p''. The target and query image do not need to be exactly the same and are allowed to have variations as long as they share some attributes so that they can belong to the same category. In this paper, the authors have made contributions to three technical areas. First is the use of non-local operations to generate better region proposals for the target image based on the query image. This operation can be thought of as a co-attention mechanism. The second contribution is proposing a Squeeze and Co-Excitation mechanism to identify and give more importance to relevant features to filter out relevant proposals and hence the instances in the target image. Third, the authors designed a margin-based ranking loss which will be useful for predicting the similarity of region proposals with the given query image irrespective of whether the label of the class is seen or unseen during the training process.<br />
<br />
== Previous Work ==<br />
<br />
All state-of-the-art object detectors are variants of deep convolutional neural networks. There are two types of object detectors:<br />
<br />
1) Two-Stage Object Detectors: These types of detectors generate region proposals in the first stage whereas classify and refine the proposals in the second stage. Eg. FasterRCNN [1].<br />
<br />
2) One Stage Object Detectors: These types of detectors directly predict bounding boxes and their corresponding labels based on a fixed set of anchors. Eg. CornerNet [2].<br />
<br />
The work done to tackle the problem of few-shot object detection is based on transfer learning [3], meta-learning [4], and metric-learning.<br />
<br />
1) Transfer Learning: Chen et al. [3] proposed a regularization technique to reduce overfitting when the model is trained on just a few instances for each class belonging to unseen classes.<br />
<br />
2) Meta-Learning: Kang et al. [4] trained a meta-model to re-weight the learned weights of an image extracted from the base model.<br />
<br />
3) Metric-Learning: These frameworks replace the conventional classifier layer with the metric-based classifier layer.<br />
<br />
== Approach ==<br />
<br />
Let <math> C </math> be the set of classes for this object detection task. Since one-shot object detection task needs unseen classes during inference time, therefore we divide the set of classes into two categories as follows:<br />
<br />
<div style="text-align: center;"><math> C = C_0 \bigcup C_1,</math></div><br />
<br />
where <math>C_0</math> represents the classes that the model is trained on and <math>C_1</math> represents the classes on which the inference is done.<br />
<br />
[[File:architecture_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Architecture</div><br />
<br />
Figure 2 shows the architecture of the model proposed in this paper. The model architecture is based on FasterRCNN [1], and ResNet-50 [5] has been used as the backbone for extracting features from the images. The target image and the query image are first passed through the ResNet-50 module to extract the features from the same convolutional layer. The features obtained are next passed into the Non-local block as input and the output consists of weighted features for each of the images. The new weighted feature set for both images is passed through Squeeze and Co-excitation block which outputs the re-weighted features which act as an input to the Region Proposal Network (RPN) module. RCNN module also consists of a new loss that is designed by the authors to rank proposals in order of their relevance.<br />
<br />
==== Non-Local Object Proposals ====<br />
<br />
The need for non-local object proposals arises because the RPN module used in Faster R-CNN [1] has access to bounding box information for each class in the training dataset. The dataset used for training and inference in the case of Faster R-CNN [1] is not exclusive. In this problem, as we have defined above that we divide the dataset into two parts, one part is used for training and the other is used during inference. Therefore, the classes in the two sets are exclusive. If the conventional RPN module is used, then the module will not be able to generate good proposals for images during inference because it will not have any information about the presence of bounding-box for those classes.<br />
<br />
To resolve this problem, a non-local operation is applied to both sets of features. This non-local operation is defined as:<br />
\begin{align}<br />
y_i = \frac{1}{C(z)} \sum_{\forall j}^{} f(x_i, z_j)g(z_j) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
where ''x'' is a vector on which this operation is applied, ''z'' is a vector which is taken as an input reference, ''i'' is the index of output position, ''j'' is the index that enumerates over all possible positions, ''C(z)'' is a normalization factor, <math>f(x_i, z_j)</math> is a pairwise function like Gaussian, Dot product, concatenation, etc., <math>g(z_j)</math> is a linear function of the form <math>W_z \times z_j</math>, and ''y'' is the output of this operation.<br />
<br />
Let the feature maps obtained from the ResNet-50 model be <math> \phi{(I)} \in R^{N \times W_I \times H_I} </math> for target image ''I'' and <math> \phi{(p)} \in R^{N \times W_p \times H_p} </math> for query image ''p''. Taking <math> \phi{(p)} </math> as the input reference, the non-local operation is applied to <math> \phi{(I)} </math> and results in a non-local block, <math> \psi{(I;p)} \in R^{N \times W_I \times H_I} </math> . Analogously, we can derive the non-local block <math> \psi{(p;I)} \in R^{N \times W_p \times H_p} </math> using <math> \phi{(I)} </math> as the input reference. <br />
<br />
We can express the extended feature maps as:<br />
<br />
\begin{align}<br />
{F(I) = \phi{(I)} \oplus \psi{(I;p)} \in R^{N \times W_I \times H_I}} \&nbsp;\&nbsp;;\&nbsp;\&nbsp; {F(p) = \phi{(p)} \oplus \psi{(p;I)} \in R^{N \times W_p \times H_p}} \tag{2} \label{eq:o1}<br />
\end{align}<br />
<br />
where ''F(I)'' denotes the extended feature map for target image ''I'', ''F(p)'' denotes the extended feature map for query image ''p'' and <math>\oplus</math> denotes element-wise sum over the feature maps <math>\phi{}</math> and <math>\psi{}</math>.<br />
<br />
As can be seen above, the extended feature set for the target image ''I'' do not only contain features from ''I'' but also the weighted sum of the target image and the query image. The same can be observed for the query image. This weighted sum is a co-attention mechanism and with the help of extended feature maps, better proposals are generated when inputted to the RPN module.<br />
<br />
==== Squeeze and Co-Excitation ====<br />
<br />
The two feature maps generated from the non-local block above can be further related by identifying the important channels and therefore, re-weighting the weights of the channels. This is the basic purpose of this module. The Squeeze layer summarizes each feature map by applying Global Average Pooling (GAP) on the extended feature map for the query image. The Co-Excitation layer gives attention to feature channels that are important for evaluating the similarity metric. The whole block can be represented as:<br />
<br />
\begin{align}<br />
SCE(F(I), F(p)) = w \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{p}) = w \odot F(p) \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{I}) = w \odot F(I)\tag{3} \label{eq:op2}<br />
\end{align}<br />
<br />
where ''w'' is the excitation vector, <math>F(\tilde{p})</math> and <math>F(\tilde{I})</math> are the re-weighted features maps for query and target image respectively.<br />
<br />
In between the Squeeze layer and Co-Excitation layer, there exist two fully-connected layers followed by a sigmoid layer which helps to learn the excitation vector ''w''. The ''Channel Attention'' module in the architecture is basically these fully-connected layers followed by a sigmoid layer.<br />
<br />
==== Margin-based Ranking Loss ====<br />
<br />
The authors have defined a two-layer MLP network ending with a softmax layer to learn a similarity metric which will help rank the proposals generated by the RPN module. In the first stage of training, each proposal is annotated with 0 or 1 based on the IoU value of the proposal with the ground-truth bounding box. If the IoU value is greater than 0.5 then that proposal is labeled as 1 (foreground) and 0 (background) otherwise.<br />
<br />
Let ''q'' be the feature vector obtained after applying GAP to the query image patch obtained from the Squeeze and Co-Excitation block and ''r'' be the feature vector obtained after applying GAP to the region proposals generated by the RPN module. The two vectors are concatenated to form a new vector ''x'' which is the input to the two-layer MLP network designed. We can define ''x = [<math>r^T;q^T</math>]''. Let ''M'' be the model representing the two-layer MLP network, then <math>s_i = M(x_i)</math>, where <math>s_i</math> is the probability of <math>i^{th}</math> proposal being a foreground proposal based on the query image patch ''q''.<br />
<br />
The margin-based ranking loss is given by:<br />
<br />
\begin{align}<br />
L_{MR}(\{x_i\}) = \sum_{i=1}^{K}y_i \times max\{m^+ - s_i, 0\} + (1-y_i) \times max\{s_i - m^-, 0\} + \delta_{i} \tag{4} \label{eq:op3}<br />
\end{align}<br />
\begin{align}<br />
\delta_{i} = \sum_{j=i+1}^{K}[y_i = y_j] \times max\{|s_i - s_j| - m^-, 0\} + [y_i \ne y_j] \times max\{m^+ - |s_i - s_j|, 0\} \tag{5} \label{eq:op4}<br />
\end{align}<br />
<br />
where ''[.]'' is the Iversion bracket, i.e. the output will be 1 if the condition inside the bracket is true and 0 otherwise, <math>m^+</math> is the expected lower bound probability for predicting a foreground proposal, <math>m^-</math> is the expected upper bound probability for predicting a background proposal and <math>K</math> is the number of candidate proposals from RPN.<br />
<br />
The total loss for the model is given as:<br />
<br />
\begin{align}<br />
L = L_{CE} + L_{Reg} + \lambda \times L_{MR} \tag{6} \label{eq:op5}<br />
\end{align}<br />
<br />
where <math>L_{CE}</math> is the cross-entropy loss, <math>L_{Reg}</math> is the regression loss for bounding boxes of Faster R-CNN [1] and <math>L_{MR}</math> is the margin-based ranking loss defined above.<br />
<br />
For this paper, <math>m^+</math> = 0.7, <math>m^-</math> = 0.3, <math>\lambda</math> = 3, K = 128, C(z) in \eqref{eq:op} is the total number of elements in a single feature map of vector ''z'', and <math>f(x_i, z_j)</math> in \eqref{eq:op} is a dot product operation.<br />
\begin{align}<br />
f(x_i, z_j) = \alpha(x_i)^T \beta(z_j)\&nbsp;\&nbsp;;\&nbsp;\&nbsp;\alpha(x_i) = W_{\alpha} x_i \&nbsp;\&nbsp;;\&nbsp;\&nbsp; \beta(z_j) = W_{\beta} z_j \tag{7} \label{eq:op6}<br />
\end{align}<br />
<br />
== Results ==<br />
<br />
The model is trained and tested on two popular datasets, VOC and COCO. The ResNet-50 model was pre-trained on a reduced dataset by removing all the classes present in the COCO dataset, thus ensuring that the model has not seen any of the classes belonging to the inference images.<br />
<br />
==== Results on VOC Dataset ====<br />
<br />
[[File: voc_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 1:''' Results on VOC dataset</div><br />
<br />
For the VOC dataset, the model is trained on the union of VOC 2007 train and validation sets and VOC 2012 train and validation sets, whereas the model is tested on VOC 2007 test set. From the VOC results (Table 1), it can be seen that the model with pre-trained ResNet-50 on a reduced training set as the CNN backbone (Ours(725)) achieves better performance on seen and unseen classes than the baseline models. When the pre-trained ResNet-50 on the full training set (Ours(1K)) is used as the CNN backbone, then the performance of the model is increased significantly.<br />
<br />
==== Results on MSCOCO Dataset ====<br />
<br />
[[File: mscoco_splits.png|750px|center|Image: 500 pixels]]<br />
[[File: mscoco_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2:''' Results on COCO dataset</div><br />
<br />
The model is trained on the COCO train2017 set and evaluated on the COCO val2017 set. The classes are divided into four groups and the model is trained with images belonging to three splits, whereas the evaluation is done on the images belonging to the fourth split. From Table 2, it is visible that the model achieved better accuracy than the baseline model. The bar chart value in the split figure shows the performance of the model on each class separately. The model is having some difficulties when predicting images belonging to classes like book (split2), handbag (split3), and tie (split4) because of variations in their shape and textures.<br />
<br />
==== Overall Performance ====<br />
For VOC, the model that uses the reduced ImageNet model backbone with 725 classes achieves a better performance on both the seen and unseen classes. Remarkable improvements in the performance are seen with the backbone with 1000 classes. For COCO, the model achieves better accuracy than the Siamese Mask-RCNN model for both the seen and unseen classes.<br />
<br />
== Ablation Studies ==<br />
<br />
==== Effect of all the proposed techniques on the final result ====<br />
<br />
[[File: one_shot_detector_results.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 3:''' Effect of all thre techniques combined</div><br />
<br />
Figure 3 shows the effect of the three proposed techniques on the evaluation metric. The model performs worst when neither Co-attention nor Co-excitation mechanism is used. But, when either Co-attention or Co-excitation is used then the performance of the model is improved significantly. The model performs best when all the three proposed techniques are used.<br />
<br />
<br />
In order to understand the effect of the proposed modules, the authors analyzed each module separately.<br />
<br />
==== Visualizing the effect of Non-local RPN ====<br />
<br />
To demonstrate the effect of Non-local RPN, a heatmap of generated proposals is constructed. Each pixel is assigned the count of how many proposals cover that particular pixel and the counts are then normalized to generate a probability map.<br />
<br />
[[File: one_shot_non_local_rpn.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 4:''' Visualization of Non-local RPN</div><br />
<br />
From Figure 4, it can be seen that when a non-local RPN is used instead of a conventional RPN, the model is able to give more attention to the relevant region in the target image.<br />
<br />
==== Analyzing and Visualizing the effect of Co-Excitation ====<br />
<br />
To visualize the effect of excitation vector ''w'', the vector is calculated for all images in the inference set which are then averaged over images belonging to the same class, and a pair-wise Euclidean distance between classes is calculated.<br />
<br />
[[File: one_shot_excitation.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 5:''' Visualization of Co-Excitation</div><br />
<br />
From Figure 5, it can be observed that the Co-Excitation mechanism is able to assign meaningful weight distribution to each class. The weights for classes related to animals are closer to each other and the ''person'' class is not close to any other class because of the absence of common attributes between ''person'' and any other class in the dataset.<br />
<br />
[[File: analyzing_co_excitation_1.png|Analyzing Co-Exitation|500px|left|bottom|Image: 500 pixels]]<br />
<br />
[[File: analyzing_co_excitation_2.png|Analyzing Co-Excitation|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 6:''' Analyzing Co-Exitation</div><br />
<br />
To analyze the effect of Co-Excitation, the authors used two different scenarios. In the first scenario (Figure 6, left), the same target image is used for different query images. <math>p_1</math> and <math>p_2</math> query images have a similar color as the target image whereas <math>p_3</math> and <math>p_4</math> query images have a different color object as compared to the target image. When the pair-wise Euclidean distance between the excitation vector in the four cases was calculated, it can be seen that <math>w_2</math> was closer to <math>w_1</math> as compared to <math>w_4</math> and <math>w_3</math> was closer to <math>w_4</math> as compared to <math>w_1</math>. Therefore, it can be concluded that <math>w_1</math> and <math>w_2</math> give more importance to the texture of the object whereas <math>w_3</math> and <math>w_4</math> give more importance to channels representing the shape of the object.<br />
<br />
The same observation can be analyzed in scenario 2 (Figure 6, right) where the same query image was used for different target images. <math>w_1</math> and <math>w_2</math> are closer to <math>w_a</math> than <math>w_b</math> whereas <math>w_3</math> and <math>w_4</math> are closer to <math>w_b</math> than <math>w_a</math>. Since images <math>I_1</math> and <math>I_2</math> have a similar color object as the query image, we can say that <math>w_1</math> and <math>w_2</math> give more weightage to the channels representing the texture of the object, and <math>w_3</math> and <math>w_4</math> give more weightage to the channels representing shape.<br />
<br />
== Conclusion ==<br />
<br />
The resulting one-shot object detector outperforms all the baseline models on VOC and COCO datasets. The authors have also provided insights about how the non-local proposals, serving as a co-attention mechanism, can generate relevant region proposals in the target image and put emphasis on the important features shared by both target and query image.<br />
<br />
== Critiques ==<br />
<br />
The techniques proposed by the authors improve the performance of the model significantly as we saw that when either of Co-attention or Co-excitation is used along with Margin-based ranking loss then the model can detect the instances of query object in the target image. Also, the model trained is generic and does not require any training/fine-tuning to detect any unseen classes in the target image. The loss metric designed makes the learning process not to rely on only the labels of images since the proposed metric annotates each proposal as a foreground or a background which is then used to calculate the metric.<br />
One Critique that comes to mind, is how time-consuming the proposed model is, since it is exploiting many deep neural networks inside the main architecture. The paper could have elucidated it more thoroughly whether the method is too time-consuming or not.<br />
<br />
== Source Code==<br />
[https://github.com/timy90022/One-Shot-Object-Detection link One-Shot-Object-Detection]<br />
<br />
== References ==<br />
<br />
[1] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.<br />
<br />
[2] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pages 765–781, 2018<br />
<br />
[3] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. LSTD: A low-shot transfer detector for object detection. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 2836–2843, 2018.<br />
<br />
[4] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. CoRR, abs/1812.01866, 2018.<br />
<br />
[5] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation&diff=48805One-Shot Object Detection with Co-Attention and Co-Excitation2020-12-02T03:41:40Z<p>Mrasooli: /* Introduction */</p>
<hr />
<div>== Presented By ==<br />
Gautam Bathla<br />
<br />
== Background ==<br />
<br />
Object Detection is a technique where the model gets an image as an input and outputs the class and location of all the objects present in the image.<br />
<br />
[[File:object_detection.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Object Detection on an image</div><br />
<br />
Figure 1 shows an example where the model identifies and locates all the instances of different objects present in the image successfully. It encloses each object within a bounding box and annotates each box with the class of the object present inside the box.<br />
<br />
State-of-the-art object detectors are trained on thousands of images for different classes before the model can accurately predict the class and spatial location for unseen images belonging to the classes the model has been trained on. When a model is trained with K labeled instances for each of N classes, then this setting is known as N-way K-shot classification. K = 0 for zero-shot learning, K = 1 for one-shot learning and K > 1 for few-shot learning.<br />
<br />
== Introduction ==<br />
<br />
This paper tackles the problem of one-shot object detection, where the model needs to find all the instances in the target image of the object in the query image for a given query image ''p''. The target and query image do not need to be exactly the same and are allowed to have variations as long as they share some attributes so that they can belong to the same category. In this paper, the authors have made contributions to three technical areas. First is the use of non-local operations to generate better region proposals for the target image based on the query image. This operation can be thought of as a co-attention mechanism. The second contribution is proposing a Squeeze and Co-Excitation mechanism to identify and give more importance to relevant features to filter out relevant proposals and hence the instances in the target image. Third, the authors designed a margin-based ranking loss which will be useful for predicting the similarity of region proposals with the given query image irrespective of whether the label of the class is seen or unseen during the training process.<br />
<br />
== Previous Work ==<br />
<br />
All state-of-the-art object detectors are variants of deep convolutional neural networks. There are two types of object detectors:<br />
<br />
1) Two-Stage Object Detectors: These types of detectors generate region proposals in the first stage whereas classify and refine the proposals in the second stage. Eg. FasterRCNN [1].<br />
<br />
2) One Stage Object Detectors: These types of detectors directly predict bounding boxes and their corresponding labels based on a fixed set of anchors. Eg. CornerNet [2].<br />
<br />
The work done to tackle the problem of few-shot object detection is based on transfer learning [3], meta-learning [4], and metric-learning.<br />
<br />
1) Transfer Learning: Chen et al. [3] proposed a regularization technique to reduce overfitting when the model is trained on just a few instances for each class belonging to unseen classes.<br />
<br />
2) Meta-Learning: Kang et al. [4] trained a meta-model to re-weight the learned weights of an image extracted from the base model.<br />
<br />
3) Metric-Learning: These frameworks replace the conventional classifier layer with the metric-based classifier layer.<br />
<br />
== Approach ==<br />
<br />
Let <math> C </math> be the set of classes for this object detection task. Since one-shot object detection task needs unseen classes during inference time, therefore we divide the set of classes into two categories as follows:<br />
<br />
<div style="text-align: center;"><math> C = C_0 \bigcup C_1,</math></div><br />
<br />
where <math>C_0</math> represents the classes that the model is trained on and <math>C_1</math> represents the classes on which the inference is done.<br />
<br />
[[File:architecture_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Architecture</div><br />
<br />
Figure 2 shows the architecture of the model proposed in this paper. The architecture of this model is based on FasterRCNN [1] and ResNet-50 [5] has been used as the backbone for extracting features from the images. The target image and the query image are first passed through the ResNet-50 module to extract the features from the same convolutional layer. The features obtained are next passed into the Non-local block as input and the output consists of weighted features for each of the images. The new weighted feature set for both images is passed through Squeeze and Co-excitation block which outputs the re-weighted features which act as an input to the Region Proposal Network (RPN) module. RCNN module also consists of a new loss that is designed by the authors to rank proposals in order of their relevance.<br />
<br />
==== Non-Local Object Proposals ====<br />
<br />
The need for non-local object proposals arises because the RPN module used in Faster R-CNN [1] has access to bounding box information for each class in the training dataset. The dataset used for training and inference in the case of Faster R-CNN [1] is not exclusive. In this problem, as we have defined above that we divide the dataset into two parts, one part is used for training and the other is used during inference. Therefore, the classes in the two sets are exclusive. If the conventional RPN module is used, then the module will not be able to generate good proposals for images during inference because it will not have any information about the presence of bounding-box for those classes.<br />
<br />
To resolve this problem, a non-local operation is applied to both sets of features. This non-local operation is defined as:<br />
\begin{align}<br />
y_i = \frac{1}{C(z)} \sum_{\forall j}^{} f(x_i, z_j)g(z_j) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
where ''x'' is a vector on which this operation is applied, ''z'' is a vector which is taken as an input reference, ''i'' is the index of output position, ''j'' is the index that enumerates over all possible positions, ''C(z)'' is a normalization factor, <math>f(x_i, z_j)</math> is a pairwise function like Gaussian, Dot product, concatenation, etc., <math>g(z_j)</math> is a linear function of the form <math>W_z \times z_j</math>, and ''y'' is the output of this operation.<br />
<br />
Let the feature maps obtained from the ResNet-50 model be <math> \phi{(I)} \in R^{N \times W_I \times H_I} </math> for target image ''I'' and <math> \phi{(p)} \in R^{N \times W_p \times H_p} </math> for query image ''p''. Taking <math> \phi{(p)} </math> as the input reference, the non-local operation is applied to <math> \phi{(I)} </math> and results in a non-local block, <math> \psi{(I;p)} \in R^{N \times W_I \times H_I} </math> . Analogously, we can derive the non-local block <math> \psi{(p;I)} \in R^{N \times W_p \times H_p} </math> using <math> \phi{(I)} </math> as the input reference. <br />
<br />
We can express the extended feature maps as:<br />
<br />
\begin{align}<br />
{F(I) = \phi{(I)} \oplus \psi{(I;p)} \in R^{N \times W_I \times H_I}} \&nbsp;\&nbsp;;\&nbsp;\&nbsp; {F(p) = \phi{(p)} \oplus \psi{(p;I)} \in R^{N \times W_p \times H_p}} \tag{2} \label{eq:o1}<br />
\end{align}<br />
<br />
where ''F(I)'' denotes the extended feature map for target image ''I'', ''F(p)'' denotes the extended feature map for query image ''p'' and <math>\oplus</math> denotes element-wise sum over the feature maps <math>\phi{}</math> and <math>\psi{}</math>.<br />
<br />
As can be seen above, the extended feature set for the target image ''I'' do not only contain features from ''I'' but also the weighted sum of the target image and the query image. The same can be observed for the query image. This weighted sum is a co-attention mechanism and with the help of extended feature maps, better proposals are generated when inputted to the RPN module.<br />
<br />
==== Squeeze and Co-Excitation ====<br />
<br />
The two feature maps generated from the non-local block above can be further related by identifying the important channels and therefore, re-weighting the weights of the channels. This is the basic purpose of this module. The Squeeze layer summarizes each feature map by applying Global Average Pooling (GAP) on the extended feature map for the query image. The Co-Excitation layer gives attention to feature channels that are important for evaluating the similarity metric. The whole block can be represented as:<br />
<br />
\begin{align}<br />
SCE(F(I), F(p)) = w \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{p}) = w \odot F(p) \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{I}) = w \odot F(I)\tag{3} \label{eq:op2}<br />
\end{align}<br />
<br />
where ''w'' is the excitation vector, <math>F(\tilde{p})</math> and <math>F(\tilde{I})</math> are the re-weighted features maps for query and target image respectively.<br />
<br />
In between the Squeeze layer and Co-Excitation layer, there exist two fully-connected layers followed by a sigmoid layer which helps to learn the excitation vector ''w''. The ''Channel Attention'' module in the architecture is basically these fully-connected layers followed by a sigmoid layer.<br />
<br />
==== Margin-based Ranking Loss ====<br />
<br />
The authors have defined a two-layer MLP network ending with a softmax layer to learn a similarity metric which will help rank the proposals generated by the RPN module. In the first stage of training, each proposal is annotated with 0 or 1 based on the IoU value of the proposal with the ground-truth bounding box. If the IoU value is greater than 0.5 then that proposal is labeled as 1 (foreground) and 0 (background) otherwise.<br />
<br />
Let ''q'' be the feature vector obtained after applying GAP to the query image patch obtained from the Squeeze and Co-Excitation block and ''r'' be the feature vector obtained after applying GAP to the region proposals generated by the RPN module. The two vectors are concatenated to form a new vector ''x'' which is the input to the two-layer MLP network designed. We can define ''x = [<math>r^T;q^T</math>]''. Let ''M'' be the model representing the two-layer MLP network, then <math>s_i = M(x_i)</math>, where <math>s_i</math> is the probability of <math>i^{th}</math> proposal being a foreground proposal based on the query image patch ''q''.<br />
<br />
The margin-based ranking loss is given by:<br />
<br />
\begin{align}<br />
L_{MR}(\{x_i\}) = \sum_{i=1}^{K}y_i \times max\{m^+ - s_i, 0\} + (1-y_i) \times max\{s_i - m^-, 0\} + \delta_{i} \tag{4} \label{eq:op3}<br />
\end{align}<br />
\begin{align}<br />
\delta_{i} = \sum_{j=i+1}^{K}[y_i = y_j] \times max\{|s_i - s_j| - m^-, 0\} + [y_i \ne y_j] \times max\{m^+ - |s_i - s_j|, 0\} \tag{5} \label{eq:op4}<br />
\end{align}<br />
<br />
where ''[.]'' is the Iversion bracket, i.e. the output will be 1 if the condition inside the bracket is true and 0 otherwise, <math>m^+</math> is the expected lower bound probability for predicting a foreground proposal, <math>m^-</math> is the expected upper bound probability for predicting a background proposal and <math>K</math> is the number of candidate proposals from RPN.<br />
<br />
The total loss for the model is given as:<br />
<br />
\begin{align}<br />
L = L_{CE} + L_{Reg} + \lambda \times L_{MR} \tag{6} \label{eq:op5}<br />
\end{align}<br />
<br />
where <math>L_{CE}</math> is the cross-entropy loss, <math>L_{Reg}</math> is the regression loss for bounding boxes of Faster R-CNN [1] and <math>L_{MR}</math> is the margin-based ranking loss defined above.<br />
<br />
For this paper, <math>m^+</math> = 0.7, <math>m^-</math> = 0.3, <math>\lambda</math> = 3, K = 128, C(z) in \eqref{eq:op} is the total number of elements in a single feature map of vector ''z'', and <math>f(x_i, z_j)</math> in \eqref{eq:op} is a dot product operation.<br />
\begin{align}<br />
f(x_i, z_j) = \alpha(x_i)^T \beta(z_j)\&nbsp;\&nbsp;;\&nbsp;\&nbsp;\alpha(x_i) = W_{\alpha} x_i \&nbsp;\&nbsp;;\&nbsp;\&nbsp; \beta(z_j) = W_{\beta} z_j \tag{7} \label{eq:op6}<br />
\end{align}<br />
<br />
== Results ==<br />
<br />
The model is trained and tested on two popular datasets, VOC and COCO. The ResNet-50 model was pre-trained on a reduced dataset by removing all the classes present in the COCO dataset, thus ensuring that the model has not seen any of the classes belonging to the inference images.<br />
<br />
==== Results on VOC Dataset ====<br />
<br />
[[File: voc_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 1:''' Results on VOC dataset</div><br />
<br />
For the VOC dataset, the model is trained on the union of VOC 2007 train and validation sets and VOC 2012 train and validation sets, whereas the model is tested on VOC 2007 test set. From the VOC results (Table 1), it can be seen that the model with pre-trained ResNet-50 on a reduced training set as the CNN backbone (Ours(725)) achieves better performance on seen and unseen classes than the baseline models. When the pre-trained ResNet-50 on the full training set (Ours(1K)) is used as the CNN backbone, then the performance of the model is increased significantly.<br />
<br />
==== Results on MSCOCO Dataset ====<br />
<br />
[[File: mscoco_splits.png|750px|center|Image: 500 pixels]]<br />
[[File: mscoco_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2:''' Results on COCO dataset</div><br />
<br />
The model is trained on the COCO train2017 set and evaluated on the COCO val2017 set. The classes are divided into four groups and the model is trained with images belonging to three splits, whereas the evaluation is done on the images belonging to the fourth split. From Table 2, it is visible that the model achieved better accuracy than the baseline model. The bar chart value in the split figure shows the performance of the model on each class separately. The model is having some difficulties when predicting images belonging to classes like book (split2), handbag (split3), and tie (split4) because of variations in their shape and textures.<br />
<br />
==== Overall Performance ====<br />
For VOC, the model that uses the reduced ImageNet model backbone with 725 classes achieves a better performance on both the seen and unseen classes. Remarkable improvements in the performance are seen with the backbone with 1000 classes. For COCO, the model achieves better accuracy than the Siamese Mask-RCNN model for both the seen and unseen classes.<br />
<br />
== Ablation Studies ==<br />
<br />
==== Effect of all the proposed techniques on the final result ====<br />
<br />
[[File: one_shot_detector_results.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 3:''' Effect of all thre techniques combined</div><br />
<br />
Figure 3 shows the effect of the three proposed techniques on the evaluation metric. The model performs worst when neither Co-attention nor Co-excitation mechanism is used. But, when either Co-attention or Co-excitation is used then the performance of the model is improved significantly. The model performs best when all the three proposed techniques are used.<br />
<br />
<br />
In order to understand the effect of the proposed modules, the authors analyzed each module separately.<br />
<br />
==== Visualizing the effect of Non-local RPN ====<br />
<br />
To demonstrate the effect of Non-local RPN, a heatmap of generated proposals is constructed. Each pixel is assigned the count of how many proposals cover that particular pixel and the counts are then normalized to generate a probability map.<br />
<br />
[[File: one_shot_non_local_rpn.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 4:''' Visualization of Non-local RPN</div><br />
<br />
From Figure 4, it can be seen that when a non-local RPN is used instead of a conventional RPN, the model is able to give more attention to the relevant region in the target image.<br />
<br />
==== Analyzing and Visualizing the effect of Co-Excitation ====<br />
<br />
To visualize the effect of excitation vector ''w'', the vector is calculated for all images in the inference set which are then averaged over images belonging to the same class, and a pair-wise Euclidean distance between classes is calculated.<br />
<br />
[[File: one_shot_excitation.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 5:''' Visualization of Co-Excitation</div><br />
<br />
From Figure 5, it can be observed that the Co-Excitation mechanism is able to assign meaningful weight distribution to each class. The weights for classes related to animals are closer to each other and the ''person'' class is not close to any other class because of the absence of common attributes between ''person'' and any other class in the dataset.<br />
<br />
[[File: analyzing_co_excitation_1.png|Analyzing Co-Exitation|500px|left|bottom|Image: 500 pixels]]<br />
<br />
[[File: analyzing_co_excitation_2.png|Analyzing Co-Excitation|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 6:''' Analyzing Co-Exitation</div><br />
<br />
To analyze the effect of Co-Excitation, the authors used two different scenarios. In the first scenario (Figure 6, left), the same target image is used for different query images. <math>p_1</math> and <math>p_2</math> query images have a similar color as the target image whereas <math>p_3</math> and <math>p_4</math> query images have a different color object as compared to the target image. When the pair-wise Euclidean distance between the excitation vector in the four cases was calculated, it can be seen that <math>w_2</math> was closer to <math>w_1</math> as compared to <math>w_4</math> and <math>w_3</math> was closer to <math>w_4</math> as compared to <math>w_1</math>. Therefore, it can be concluded that <math>w_1</math> and <math>w_2</math> give more importance to the texture of the object whereas <math>w_3</math> and <math>w_4</math> give more importance to channels representing the shape of the object.<br />
<br />
The same observation can be analyzed in scenario 2 (Figure 6, right) where the same query image was used for different target images. <math>w_1</math> and <math>w_2</math> are closer to <math>w_a</math> than <math>w_b</math> whereas <math>w_3</math> and <math>w_4</math> are closer to <math>w_b</math> than <math>w_a</math>. Since images <math>I_1</math> and <math>I_2</math> have a similar color object as the query image, we can say that <math>w_1</math> and <math>w_2</math> give more weightage to the channels representing the texture of the object, and <math>w_3</math> and <math>w_4</math> give more weightage to the channels representing shape.<br />
<br />
== Conclusion ==<br />
<br />
The resulting one-shot object detector outperforms all the baseline models on VOC and COCO datasets. The authors have also provided insights about how the non-local proposals, serving as a co-attention mechanism, can generate relevant region proposals in the target image and put emphasis on the important features shared by both target and query image.<br />
<br />
== Critiques ==<br />
<br />
The techniques proposed by the authors improve the performance of the model significantly as we saw that when either of Co-attention or Co-excitation is used along with Margin-based ranking loss then the model can detect the instances of query object in the target image. Also, the model trained is generic and does not require any training/fine-tuning to detect any unseen classes in the target image. The loss metric designed makes the learning process not to rely on only the labels of images since the proposed metric annotates each proposal as a foreground or a background which is then used to calculate the metric.<br />
One Critique that comes to mind, is how time-consuming the proposed model is, since it is exploiting many deep neural networks inside the main architecture. The paper could have elucidated it more thoroughly whether the method is too time-consuming or not.<br />
<br />
== Source Code==<br />
[https://github.com/timy90022/One-Shot-Object-Detection link One-Shot-Object-Detection]<br />
<br />
== References ==<br />
<br />
[1] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.<br />
<br />
[2] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pages 765–781, 2018<br />
<br />
[3] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. LSTD: A low-shot transfer detector for object detection. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 2836–2843, 2018.<br />
<br />
[4] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. CoRR, abs/1812.01866, 2018.<br />
<br />
[5] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization&diff=48765Meta-Learning For Domain Generalization2020-12-02T01:18:05Z<p>Mrasooli: /* Source Code */</p>
<hr />
<div>== Presented by ==<br />
Parsa Ashrafi Fashi<br />
<br />
== Introduction ==<br />
<br />
This paper proposes a novel meta-learning method for domain generalization. Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.<br />
<br />
== Previous Work ==<br />
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].<br />
<br />
== Method ==<br />
Let <math> S </math> and T be source and target domains in the DG setting, respectively. We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> . <br />
<br />
The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.<br />
<br />
=== Supervised Learning ===<br />
<br />
First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.<br />
<br />
==== Meta-Train ====<br />
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math><br />
<br />
==== Meta-Test ====<br />
<br />
In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>. <br />
<br />
==== Final Objective Function ====<br />
<br />
Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach. <br />
<br />
[[File:ashraf1.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div><br />
<br />
=== Reinforcement Learning ===<br />
<br />
In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. What we have in supervised learning as tasks map to reward functions here and domains map to solving the same task (reward function) in a different environments. Therefore, domain generalization achieves an agent that is able to perform well even at new environments without any initial learning.<br />
==== Meta-Train ==== <br />
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is, <br />
\begin{align}<br />
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s<br />
\end{align}<br />
<br />
Then the optimal policy is obtained by minimizing <math> F </math>.<br />
<br />
==== Meta-Test ====<br />
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.<br />
[[File:ashraf2.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div><br />
<br />
==== Alternative Variants of MLDG ====<br />
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.<br />
\end{equation}<br />
<br />
Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2<br />
\end{equation}.<br />
<br />
== Experiments ==<br />
<br />
The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments). <br />
<br />
=== Illustrative Synthetic Experiment ===<br />
<br />
In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner. The baselines for this experiment, as can be seen in Fig. 1, were MLP-All, MLDG, MLDG-GC, and MLDG-GN.<br />
<br />
[[File:ashraf3.jpg |center|600px]]<br />
<br />
<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div><br />
<br />
=== Object Detection === <br />
The PACS multi-domain recognition benchmark is exploited to address the object detection task; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods. <br />
<br />
[[File:ashraf4.jpg |center|800px]]<br />
<br />
<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div><br />
<br />
=== Cartpole ===<br />
<br />
The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.<br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div><br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div><br />
<br />
=== Mountain Car ===<br />
<br />
In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.<br />
<br />
[[File:ashraf7.jpg |center|800px]]<br />
<br />
<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div><br />
<br />
== Conclusion ==<br />
<br />
This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.<br />
<br />
== Source Code ==<br />
<br />
Four different implementations of this paper are publicly available at [https://paperswithcode.com/paper/learning-to-generalize-meta-learning-for#code link MLDG]<br />
<br />
== Critiques ==<br />
<br />
I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.<br />
<br />
<br />
Why meta-learning makes the domain generalization to be domain agnostic? <br />
<br />
In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?<br />
<br />
The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets.<br />
<br />
== References ==<br />
<br />
[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.<br />
<br />
[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.<br />
<br />
[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.<br />
<br />
[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.<br />
<br />
[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.<br />
<br />
[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.<br />
<br />
[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization&diff=48764Meta-Learning For Domain Generalization2020-12-02T01:17:07Z<p>Mrasooli: /* Object Detection */</p>
<hr />
<div>== Presented by ==<br />
Parsa Ashrafi Fashi<br />
<br />
== Introduction ==<br />
<br />
This paper proposes a novel meta-learning method for domain generalization. Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.<br />
<br />
== Previous Work ==<br />
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].<br />
<br />
== Method ==<br />
Let <math> S </math> and T be source and target domains in the DG setting, respectively. We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> . <br />
<br />
The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.<br />
<br />
=== Supervised Learning ===<br />
<br />
First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.<br />
<br />
==== Meta-Train ====<br />
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math><br />
<br />
==== Meta-Test ====<br />
<br />
In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>. <br />
<br />
==== Final Objective Function ====<br />
<br />
Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach. <br />
<br />
[[File:ashraf1.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div><br />
<br />
=== Reinforcement Learning ===<br />
<br />
In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. What we have in supervised learning as tasks map to reward functions here and domains map to solving the same task (reward function) in a different environments. Therefore, domain generalization achieves an agent that is able to perform well even at new environments without any initial learning.<br />
==== Meta-Train ==== <br />
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is, <br />
\begin{align}<br />
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s<br />
\end{align}<br />
<br />
Then the optimal policy is obtained by minimizing <math> F </math>.<br />
<br />
==== Meta-Test ====<br />
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.<br />
[[File:ashraf2.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div><br />
<br />
==== Alternative Variants of MLDG ====<br />
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.<br />
\end{equation}<br />
<br />
Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2<br />
\end{equation}.<br />
<br />
== Experiments ==<br />
<br />
The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments). <br />
<br />
=== Illustrative Synthetic Experiment ===<br />
<br />
In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner. The baselines for this experiment, as can be seen in Fig. 1, were MLP-All, MLDG, MLDG-GC, and MLDG-GN.<br />
<br />
[[File:ashraf3.jpg |center|600px]]<br />
<br />
<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div><br />
<br />
=== Object Detection === <br />
The PACS multi-domain recognition benchmark is exploited to address the object detection task; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods. <br />
<br />
[[File:ashraf4.jpg |center|800px]]<br />
<br />
<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div><br />
<br />
=== Cartpole ===<br />
<br />
The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.<br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div><br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div><br />
<br />
=== Mountain Car ===<br />
<br />
In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.<br />
<br />
[[File:ashraf7.jpg |center|800px]]<br />
<br />
<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div><br />
<br />
== Conclusion ==<br />
<br />
This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.<br />
<br />
== Source Code ==<br />
<br />
Four different implementations of this paper are freely available at [https://paperswithcode.com/paper/learning-to-generalize-meta-learning-for#code link MLDG]<br />
<br />
== Critiques ==<br />
<br />
I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.<br />
<br />
<br />
Why meta-learning makes the domain generalization to be domain agnostic? <br />
<br />
In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?<br />
<br />
The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets.<br />
<br />
== References ==<br />
<br />
[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.<br />
<br />
[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.<br />
<br />
[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.<br />
<br />
[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.<br />
<br />
[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.<br />
<br />
[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.<br />
<br />
[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization&diff=48763Meta-Learning For Domain Generalization2020-12-02T01:12:22Z<p>Mrasooli: /* Introduction */</p>
<hr />
<div>== Presented by ==<br />
Parsa Ashrafi Fashi<br />
<br />
== Introduction ==<br />
<br />
This paper proposes a novel meta-learning method for domain generalization. Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.<br />
<br />
== Previous Work ==<br />
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].<br />
<br />
== Method ==<br />
Let <math> S </math> and T be source and target domains in the DG setting, respectively. We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> . <br />
<br />
The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.<br />
<br />
=== Supervised Learning ===<br />
<br />
First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.<br />
<br />
==== Meta-Train ====<br />
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math><br />
<br />
==== Meta-Test ====<br />
<br />
In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>. <br />
<br />
==== Final Objective Function ====<br />
<br />
Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach. <br />
<br />
[[File:ashraf1.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div><br />
<br />
=== Reinforcement Learning ===<br />
<br />
In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. What we have in supervised learning as tasks map to reward functions here and domains map to solving the same task (reward function) in a different environments. Therefore, domain generalization achieves an agent that is able to perform well even at new environments without any initial learning.<br />
==== Meta-Train ==== <br />
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is, <br />
\begin{align}<br />
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s<br />
\end{align}<br />
<br />
Then the optimal policy is obtained by minimizing <math> F </math>.<br />
<br />
==== Meta-Test ====<br />
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.<br />
[[File:ashraf2.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div><br />
<br />
==== Alternative Variants of MLDG ====<br />
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.<br />
\end{equation}<br />
<br />
Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2<br />
\end{equation}.<br />
<br />
== Experiments ==<br />
<br />
The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments). <br />
<br />
=== Illustrative Synthetic Experiment ===<br />
<br />
In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner. The baselines for this experiment, as can be seen in Fig. 1, were MLP-All, MLDG, MLDG-GC, and MLDG-GN.<br />
<br />
[[File:ashraf3.jpg |center|600px]]<br />
<br />
<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div><br />
<br />
=== Object Detection === <br />
For object detection, the PACS multi-domain recognition benchmark is exploited; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods. <br />
<br />
[[File:ashraf4.jpg |center|800px]]<br />
<br />
<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div><br />
<br />
=== Cartpole ===<br />
<br />
The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.<br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div><br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div><br />
<br />
=== Mountain Car ===<br />
<br />
In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.<br />
<br />
[[File:ashraf7.jpg |center|800px]]<br />
<br />
<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div><br />
<br />
== Conclusion ==<br />
<br />
This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.<br />
<br />
== Source Code ==<br />
<br />
Four different implementations of this paper are freely available at [https://paperswithcode.com/paper/learning-to-generalize-meta-learning-for#code link MLDG]<br />
<br />
== Critiques ==<br />
<br />
I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.<br />
<br />
<br />
Why meta-learning makes the domain generalization to be domain agnostic? <br />
<br />
In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?<br />
<br />
The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets.<br />
<br />
== References ==<br />
<br />
[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.<br />
<br />
[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.<br />
<br />
[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.<br />
<br />
[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.<br />
<br />
[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.<br />
<br />
[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.<br />
<br />
[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=48533stat940F212020-11-30T20:56:44Z<p>Mrasooli: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]]<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS Summary] || [https://drive.google.com/file/d/1mZVlF2UvJ2lGjuVcN5SYqBuO4jZjuCcU/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation Summary] || [https://drive.google.com/file/d/1OUx64_pTZzCQAdo_fmy_9h9NbuccTnn6/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] || Learn<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification Summary] || [https://www.youtube.com/watch?v=jG57QgY71yU video]<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes Summary]|| Learn<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] || [https://youtu.be/D54qsSkqryk video] or Learn<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||RoBERTa: A Robustly Optimized BERT Pretraining Approach ||[https://openreview.net/forum?id=SyxS0T4tvS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta Summary] || [https://youtu.be/JdfvvYbH-2s Presentation Video]<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT Summary] || Learn<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||A critical analysis of self-supervision, or what we can learn from a single image|| [https://openreview.net/pdf?id=B1esx6EYvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION Summary]|| [https://youtu.be/HkkacHrvloE YouTube]<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Self-Supervised Learning of Pretext-Invariant Representations || [https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations Summary] || [https://www.youtube.com/watch?v=IlIPHclzV5E&ab_channel=sinaebrahimifarsangi YouTube] or Learn<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Summary]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48530CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-30T20:50:35Z<p>Mrasooli: /* Results */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. The result table indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images is not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48529CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-30T20:48:15Z<p>Mrasooli: /* Results */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2,3 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center|||Figure 3. Linear probing evaluation on ImageNet]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images is not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48528CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-30T20:48:02Z<p>Mrasooli: /* Results */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2,3 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center||Figure 3. Linear probing evaluation on ImageNet]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images is not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48527CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-30T20:47:43Z<p>Mrasooli: /* Results */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2,3 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center|Figure 3. Linear probing evaluation on ImageNet]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images is not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:table_results_imageNet_SSL.png&diff=48526File:table results imageNet SSL.png2020-11-30T20:46:54Z<p>Mrasooli: </p>
<hr />
<div></div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:table_results_imageNet_SSL_2.png&diff=48525File:table results imageNet SSL 2.png2020-11-30T20:46:25Z<p>Mrasooli: Mrasooli uploaded a new version of File:table results imageNet SSL 2.png</p>
<hr />
<div></div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:table_results_imageNet_SSL_2.png&diff=48524File:table results imageNet SSL 2.png2020-11-30T20:46:10Z<p>Mrasooli: </p>
<hr />
<div></div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48523CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-30T20:43:30Z<p>Mrasooli: /* Results */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2,3 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
[[File:histo.png|500px|center|Figure 3.]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images is not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48522CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-30T20:43:09Z<p>Mrasooli: /* Results */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2,3 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|left]]<br />
[[File:histo.png|500px|right|Figure 3.]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images is not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=48520stat940F212020-11-30T20:31:05Z<p>Mrasooli: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]]<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS Summary] || [https://drive.google.com/file/d/1mZVlF2UvJ2lGjuVcN5SYqBuO4jZjuCcU/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation Summary] || [https://drive.google.com/file/d/1OUx64_pTZzCQAdo_fmy_9h9NbuccTnn6/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] || Learn<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification Summary] || [https://www.youtube.com/watch?v=jG57QgY71yU video]<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes Summary]|| Learn<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] || [https://youtu.be/D54qsSkqryk video] or Learn<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||RoBERTa: A Robustly Optimized BERT Pretraining Approach ||[https://openreview.net/forum?id=SyxS0T4tvS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta Summary] || [https://youtu.be/JdfvvYbH-2s Presentation Video]<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT Summary] || Learn<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||A critical analysis of self-supervision, or what we can learn from a single image|| [https://openreview.net/pdf?id=B1esx6EYvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION Summary]|| Learn<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Self-Supervised Learning of Pretext-Invariant Representations || [https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations Summary] || [https://www.youtube.com/watch?v=IlIPHclzV5E&ab_channel=sinaebrahimifarsangi YouTube] or Learn<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Summary]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=48519stat940F212020-11-30T20:30:33Z<p>Mrasooli: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]]<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS Summary] || [https://drive.google.com/file/d/1mZVlF2UvJ2lGjuVcN5SYqBuO4jZjuCcU/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation Summary] || [https://drive.google.com/file/d/1OUx64_pTZzCQAdo_fmy_9h9NbuccTnn6/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] || Learn<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification Summary] || [https://www.youtube.com/watch?v=jG57QgY71yU video]<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes Summary]|| Learn<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] || [https://youtu.be/D54qsSkqryk video] or Learn<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||RoBERTa: A Robustly Optimized BERT Pretraining Approach ||[https://openreview.net/forum?id=SyxS0T4tvS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta Summary] || [https://youtu.be/JdfvvYbH-2s Presentation Video]<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT Summary] || Learn<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||A critical analysis of self-supervision, or what we can learn from a single image|| [https://openreview.net/pdf?id=B1esx6EYvr Paper] || Learn [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION Summary]||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Self-Supervised Learning of Pretext-Invariant Representations || [https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations Summary] || [https://www.youtube.com/watch?v=IlIPHclzV5E&ab_channel=sinaebrahimifarsangi YouTube] or Learn<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Summary]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=48516When Does Self-Supervision Improve Few-Shot Learning?2020-11-30T20:25:34Z<p>Mrasooli: /* Introduction */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labeled data sets. <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. The following image indicates the rotation prediction as a proxy task in self-supervision. The proposed method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects. Self-supervision is an inevitable and powerful method for taking advantage of the vast amount of unlabeled data.<br />
<br />
[[File:rotation prediction 22.png|500px|center]]<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2]. [note 1]<br />
<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique, unlabelled data is utilized which can avoid incurring the computational expenses of labeling and maintaining a massive data set. Images already contain structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include task prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labeled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. When evaluating the accuracy of the model only the mappings of labelled images by the classifier<math>g</math> will be considered. Whereas when training the model both mappings of labelled and unlabelled images by <math>g</math> and <math>h</math> respectively will be utilized. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image. In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x}, \hat{y}) </math>, where <math>\hat{x}</math> is the identity mapping of <math>x</math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
To assess the proposed method, several datasets, e.g., Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet, have been employed. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance, which calculates the number of permutations needed to convert the tiled and shuffled image back to its original form.<br />
<br />
== Results ==<br />
An N-way k-shot classification task contains N unique classes with k labeled images per class. The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as a function of the distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
== Source Codes ==<br />
<br />
The source code can be found here: https://github.com/cvl-umass/fsl_ssl .<br />
== Conclusion ==<br />
The authors of this paper provide us with great insight into the effects of using SSL as a regularizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples, and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
I believe that both strength and weakness of this paper is in its experiments. Different experiments compare a variety self-supervised learning algorithms which is a good point. However, as the reviewers also pointed out, there are some concerns including the level of novelty in the work, the way of creating unlabeled pool, and finally employing pre-trained ResNet-101 on ImageNet and mini-ImageNet in their experiments.<br />
<br />
== Notes ==<br />
:1. Model-Agnostic Meta-learning (MAML): Neural networks are performing very well at many tasks, but they often require large datasets. On the contrary, humans are able to learn new skills with little examples. MAML is trained with different tasks, which have the role of training sets, and is used to learn new tasks that are like test sets. Therefore, MAML is able to perform well on tasks with small training sets without overfitting to the data.[5]<br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)<br />
<br />
[5]: Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=48515When Does Self-Supervision Improve Few-Shot Learning?2020-11-30T20:25:08Z<p>Mrasooli: /* Introduction */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labeled data sets. <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. The following image indicates the rotation prediction as proxy task in self-supervision. The proposed method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects. Self-supervision is an inevitable and powerful method for taking advantage of the vast amount of unlabeled data.<br />
<br />
[[File:rotation prediction 22.png|500px|center]]<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2]. [note 1]<br />
<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique, unlabelled data is utilized which can avoid incurring the computational expenses of labeling and maintaining a massive data set. Images already contain structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include task prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labeled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. When evaluating the accuracy of the model only the mappings of labelled images by the classifier<math>g</math> will be considered. Whereas when training the model both mappings of labelled and unlabelled images by <math>g</math> and <math>h</math> respectively will be utilized. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image. In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x}, \hat{y}) </math>, where <math>\hat{x}</math> is the identity mapping of <math>x</math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
To assess the proposed method, several datasets, e.g., Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet, have been employed. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance, which calculates the number of permutations needed to convert the tiled and shuffled image back to its original form.<br />
<br />
== Results ==<br />
An N-way k-shot classification task contains N unique classes with k labeled images per class. The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as a function of the distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
== Source Codes ==<br />
<br />
The source code can be found here: https://github.com/cvl-umass/fsl_ssl .<br />
== Conclusion ==<br />
The authors of this paper provide us with great insight into the effects of using SSL as a regularizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples, and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
I believe that both strength and weakness of this paper is in its experiments. Different experiments compare a variety self-supervised learning algorithms which is a good point. However, as the reviewers also pointed out, there are some concerns including the level of novelty in the work, the way of creating unlabeled pool, and finally employing pre-trained ResNet-101 on ImageNet and mini-ImageNet in their experiments.<br />
<br />
== Notes ==<br />
:1. Model-Agnostic Meta-learning (MAML): Neural networks are performing very well at many tasks, but they often require large datasets. On the contrary, humans are able to learn new skills with little examples. MAML is trained with different tasks, which have the role of training sets, and is used to learn new tasks that are like test sets. Therefore, MAML is able to perform well on tasks with small training sets without overfitting to the data.[5]<br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)<br />
<br />
[5]: Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:rotation_prediction_22.png&diff=48514File:rotation prediction 22.png2020-11-30T20:22:00Z<p>Mrasooli: </p>
<hr />
<div></div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:rotation_prediction.png&diff=48513File:rotation prediction.png2020-11-30T20:21:18Z<p>Mrasooli: </p>
<hr />
<div></div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=47189CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-28T05:32:17Z<p>Mrasooli: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as rotation estimation. For example, we have a picture of a cat without the label "cat". We rotate the cat image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis [3].<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:intro.png&diff=47186File:intro.png2020-11-28T05:31:59Z<p>Mrasooli: </p>
<hr />
<div></div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:unsupervised.png&diff=47182File:unsupervised.png2020-11-28T05:30:50Z<p>Mrasooli: Mrasooli uploaded a new version of File:unsupervised.png</p>
<hr />
<div></div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Unsupervised.png&diff=47180File:Unsupervised.png2020-11-28T05:30:14Z<p>Mrasooli: Mrasooli uploaded a new version of File:Unsupervised.png</p>
<hr />
<div></div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=46982CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-27T04:30:25Z<p>Mrasooli: /* Method & Experiment */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as rotation estimation. For example, we have a picture of a cat without the label "cat". We rotate the cat image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis [3].<br />
<br />
[[File:unsupervised.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=46981CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-27T04:28:33Z<p>Mrasooli: /* Method & Experiment */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as rotation estimation. For example, we have a picture of a cat without the label "cat". We rotate the cat image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis [3].<br />
<br />
[[File:unsupervised.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=46980CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-27T04:28:12Z<p>Mrasooli: /* Method & Experiment */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as rotation estimation. For example, we have a picture of a cat without the label "cat". We rotate the cat image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis [3].<br />
<br />
[[File:unsupervised.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=46979CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-27T04:26:52Z<p>Mrasooli: /* Method & Experiment */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as rotation estimation. For example, we have a picture of a cat without the label "cat". We rotate the cat image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis [3].<br />
<br />
[[File:unsupervised.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a model and. Note that the main purpose of CNNs is to reach a linearly separable representation for images. Accordingly, linear probing technique aims to evaluate the training of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=46978CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-27T04:24:54Z<p>Mrasooli: /* References */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as rotation estimation. For example, we have a picture of a cat without the label "cat". We rotate the cat image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis [3].<br />
<br />
[[File:unsupervised.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to inspect how much information each of the layers learned. Note that the main purpose of CNNs is to reach a linearly separable representation for images. Accordingly, linear probing technique aims to evaluate the training of a CNN based on whether features are discriminable.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=46977CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-27T04:24:44Z<p>Mrasooli: /* Method & Experiment */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. Also, this paper aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of vast amount of unlabeled data for training CNNs and finding a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as rotation estimation. For example, we have a picture of a cat without the label "cat". We rotate the cat image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis [3].<br />
<br />
[[File:unsupervised.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to inspect how much information each of the layers learned. Note that the main purpose of CNNs is to reach a linearly separable representation for images. Accordingly, linear probing technique aims to evaluate the training of a CNN based on whether features are discriminable.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149</div>Mrasooli