http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Dmaleki&feedformat=atomstatwiki - User contributions [US]2024-03-28T12:30:46ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48908Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T20:14:54Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the discriminator's derivative to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way to evaluate the quality of the generator and inspect the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator, which can learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification tasks and comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As shown in Figure 2, although the training has been unsupervised, the semantic relation between classes is well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the defined formula and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=48906STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-12-02T18:28:04Z<p>Dmaleki: /* critique */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed with it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. The deep bidirectional model is strictly more powerful than the left-to-right, or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pairs of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by masks to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we will compare BERT with previous language models, particularly ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it can capture more context information than GPT and tends to perform better when context information from both sides is important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture, we can better understand the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that only BERT representations are jointly conditioned on both directions' context in all layers among these three models.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
<br />
== critique ==<br />
Bert showed that transformers could be a good architecture to solve NLP downstream tasks but they didn't care about choosing their hyperparameters or even training and pre-training choices. As Albert[3], RoBERTa[4] shown in their paper, by choosing better hyperparameters or even training choices, we can have a similar or even better performance with less time and training data.<br />
<br />
== Repository ==<br />
<br />
A github repository for BERT is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== Fun facts ==<br />
<br />
A collection of BERT-related papers published in 2019. The y-axis is the log of the citation count (based on Google Scholar).<br />
[[File:BERT-related.gif|800px|center]]<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)<br />
<br />
[3] Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations." arXiv preprint arXiv:1909.11942 (2019).<br />
[4] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=48905STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-12-02T18:27:25Z<p>Dmaleki: </p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed with it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. The deep bidirectional model is strictly more powerful than the left-to-right, or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pairs of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by masks to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we will compare BERT with previous language models, particularly ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it can capture more context information than GPT and tends to perform better when context information from both sides is important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture, we can better understand the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that only BERT representations are jointly conditioned on both directions' context in all layers among these three models.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
<br />
== critique ==<br />
Bert shows that transformers can be a good architecture to solve NLP downstream tasks but they didn't care about choosing their hyperparameters or even training and pre-training choices. As Albert[3], RoBERTa[4] shown in their paper, by choosing better hyperparameters or even training choices, we can have a similar or even better performance with less time and training data.<br />
== Repository ==<br />
<br />
A github repository for BERT is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== Fun facts ==<br />
<br />
A collection of BERT-related papers published in 2019. The y-axis is the log of the citation count (based on Google Scholar).<br />
[[File:BERT-related.gif|800px|center]]<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)<br />
<br />
[3] Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations." arXiv preprint arXiv:1909.11942 (2019).<br />
[4] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=48701When Does Self-Supervision Improve Few-Shot Learning?2020-12-01T18:10:13Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labeled data sets. <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in the hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. The following image indicates the rotation prediction as a proxy task in self-supervision. The proposed method can help against generalization issues where the agent cannot distinguish between newly introduced objects. Self-supervision is an inevitable and powerful method for taking advantage of the vast amount of unlabeled data.<br />
<br />
[[File:rotation prediction 22.png|500px|center]]<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2]. [note 1]<br />
<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique, unlabelled data is utilized which can avoid incurring the computational expenses of labeling and maintaining a massive data set. Images already contain structural information that can be utilized. Many SSL tasks exist, such as removing a part of the data for the agent to reconstruct the lost part. Other methods include task prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning is mutually beneficial. This has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labeled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. When evaluating the accuracy of the model only the mappings of labelled images by the classifier<math>g</math> will be considered. Whereas when training the model both mappings of labelled and unlabelled images by <math>g</math> and <math>h</math> respectively will be utilized. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image. In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x}, \hat{y}) </math>, where <math>\hat{x}</math> is the identity mapping of <math>x</math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
To assess the proposed method, several datasets, e.g., Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet, have been employed. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance, which calculates the number of permutations needed to convert the tiled and shuffled image back to its original form.<br />
<br />
== Results ==<br />
An N-way k-shot classification task contains N unique classes with k labeled images per class. The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as a function of the distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
== Source Codes ==<br />
<br />
The source code can be found here: https://github.com/cvl-umass/fsl_ssl .<br />
== Conclusion ==<br />
The authors of this paper provide us with great insight into the effects of using SSL as a regularizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples, and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
I believe that both strength and weakness of this paper is in its experiments. Different experiments compare a variety self-supervised learning algorithms which is a good point. However, as the reviewers also pointed out, there are some concerns including the level of novelty in the work, the way of creating unlabeled pool, and finally employing pre-trained ResNet-101 on ImageNet and mini-ImageNet in their experiments.<br />
<br />
The authors use a multi-task learning approach with self-supervision. But this approach is already used in various tasks, e.g., domain adaptation, semi-supervised learning, training GANs. So, in my opinion, their approach is incremental based on previous works. Moreover, they showed some quite interesting and even surprising results that may need more consideration such as figure 7 in the summary. I can see some of their claims may not match the results.<br />
<br />
== Notes ==<br />
:1. Model-Agnostic Meta-learning (MAML): Neural networks are performing very well at many tasks, but they often require large datasets. On the contrary, humans are able to learn new skills with little examples. MAML is trained with different tasks, which have the role of training sets, and is used to learn new tasks that are like test sets. Therefore, MAML is able to perform well on tasks with small training sets without overfitting to the data.[5]<br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)<br />
<br />
[5]: Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=48693THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-12-01T17:40:16Z<p>Dmaleki: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data such as molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes to both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregate of the feature vectors of its neighbors. Such GNNs are called aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. This work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs called AC-GNNs in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors. Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
<br />
The following are the main contributions:<br />
<br />
1. They characterize exactly the fragment of FOC2 formulas that can be expressed as AC-GNNs. This fragment corresponds to graded modal logic (de Rijke, 2000) or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, where in each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that an ACR-GNN can capture each FOC2 formula.<br />
<br />
They experimentally validate their findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, AC-GNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
This paper concentrates on the problem of boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, the authors aim to classify each graph node as true or false. This paper assumes that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
Where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
=== 1. LOGICAL NODE CLASSIFIER ===<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that they call these classifiers logical classifiers). For example,<br />
\[<br />
\alpha(x) := Red(x) \land \exists y (E(x,u) \land Blue(y)) \land \exists z (E(x,z) \land Green(z))<br />
\]<br />
has one free variable namely, <math> x </math> and two quantified variables <math> y </math> and <math> z </math>. Formally, the authors defined the following definition for a logical node calssifier.<br />
<br />
'''Definition 3.1''' A GNN classifier <math> \mathcal{A} </math> captures a logical classifier <math> \varphi (x) </math> if for every graph G and node v in G, it holds that <math> \mathcal{A}(G,v) = \textrm{true} </math> if and only if <math> (G,v) \models \varphi </math>.<br />
<br />
=== 2. LOGIC FOC2 ===<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that in terms of their logical expressiveness, FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2 - the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2). The author gives the following proposition regarding the choice of logic FOC2 for measuring the expressiveness of AC-GNNs.<br />
<br />
'''Proposition 3.2''' For any graph G and nodes u,v in G, the WL test colors v and u the same after any number of rounds if and only if u and v are classified the same by all FOC2 classifiers.<br />
<br />
=== 3. FOC2 AND AC-GNN CLASSIFIER ===<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
=== THE EXPRESSIVE POWER OF AC-GNNS ===<br />
AC-GNNs capture any FOC2 classifier as long as they further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2 and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all sub-formulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether some node satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, they are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: they can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, a logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combines operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
The backward direction of this theorem is that a simple homogeneous AC-GNN captures each graded modal logic classifier.<br />
They point out that the forward direction holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
=== ACR-GNNs ===<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), they refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|700px|center|Image: 700 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally, it combines the features of v with the two aggregation vectors.<br />
<br />
They know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property as “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
They then show that just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results from using any number of layers as in the AC-GNN definition, together with a final layer uses a readout function.<br />
<br />
== Experiments ==<br />
The authors performed experiments with synthetic data to empirically validate their results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, they also tested the GIN network proposed by Xu et al. (2019) (they consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Their experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided into three sets: the train set with 5k graphs of size up to 50-100 nodes, the test set with 500 graphs of a size similar to the train set, and another test set with 500 graphs of size bigger than the train set. They tried several configurations for the aggregation, combination readout functions, and report the accuracy on the best configuration. In their experiments, accuracy is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case, they run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|600px|center|Image: 600 pixels]]<br />
<br />
[[File:a227_table2.png|560px|center|Image: 600 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what they expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
Table 2 above corresponds to the results E-R synthetic data for nodes labeled by the below classifier. ACR-GNNs performance up to 3 layers is reported. For the bigger test set, it was also observed that AC-GNNs and GINs are unable to substantially depart from a trivial baseline of 50%.<br />
<br />
[[File:a227eq6.png|400px|center|Image: 400 pixels]]<br />
<br />
'''Statistics of the datasets used for the above equation is shown below''' <br />
<br />
[[File:Paper13_Statistics_Dataset.png|center]]<br />
<br />
== Final Remarks ==<br />
<br />
The paper's results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, their work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding the results on the links between AC-GNNs and graded modal logic (Theorem 4.2), the very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how their results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
The code for this paper is freely available at [https://github.com/juanpablos/GNN-logic link GNN-logic]<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The content structure is very well organized, and the explanations are easy to understand for an average reader. They have also discussed future work and possibilities. They could have given more commentary about the performance difference across different classifiers.<br />
<br />
The fact that no actual difference in performance between AC-GNNs and ACR-GNNs was noticed in the only non-synthetic dataset used in the experiment should prompt the author to run experiments with more real-life datasets to verify the results empirically.<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=48690DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-12-01T17:16:23Z<p>Dmaleki: </p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning, and it refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance. The proposed method is based on model-free RL with latent state representation that is learned via prediction. The authors have changed the belief representations to learn a critic directly on latent state samples which help to enable scaling to more complex tasks.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math> as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively. Informally, <math>V_{\pi}</math> tells one how good a state is in terms of the expected return when starting in the state <math>s</math> and then following the policy <math>\pi</math>. Similarly <math>Q_{\pi}</math> gives the value of the expected return starting from the state <math>s</math>, taking the action <math>a</math>, and subsequently following the policy <math>\pi</math>. <br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. In a high-level perspective, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 600px | center]]<br />
<br />
<br />
Let's look at Dreamer in detail. It consists of :<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math>o_{t}</math> is the observation at time <math>t</math> and <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
The main three components of agent learning in imagination are dynamics learning, behavior learning, and environment interaction. In the compact latent space of the world model, the behavior is learned by predicting hypothetical trajectories. Throughout the agent's lifetime, Dreamer performs the following operations either in parallel or interleaved as shown in Figure 3 and Algorithm 1:<br />
<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:ashraf98.png|frameless|700px|Dreamer algorithm|center]]<br />
<br />
Notice that three neural networks are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively. The action model tries to solve the imagination environment by predicting various actions. Meanwhile, the value model estimates the expected rewards that the action model will achieve. Hence, these two models are trained cooperatively whereby the action model tries to maximize the estimated value while the value model gives the estimate based on the action model's actions.<br />
<br />
== Related Works ==<br />
<br />
Previous Works that exploited latent dynamics can be grouped in 3 sections:<br />
<br />
* Visual Control with latent dynamics by derivative-free policy learning or online planning.<br />
* Augment model-free agents with multi-step predictions.<br />
* Use analytic gradients of Q-values.<br />
<br />
While the later approaches are often for low-dimensional tasks, Dreamer uses analytic gradients to efficiently learn long-horizon behaviors for visual control purely by latent imagination.<br />
<br />
== Results ==<br />
In the following picture we can see the reward vs the environment steps. As we can see the Dreamer outperforms other baseline algorithms. Moreover, the convergence is a lot faster in the Dreamer algorithm. <br />
[[File:dreamer.paper19.png|frameless|500px|Rewards vs environment steps of Dreamer and other baseline algorithms]]<br />
<br />
<br />
The figure below summarizes Dreamer's performance compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Using the same hyper parameters for all tasks, Dreamer exceeds previous model-based and model-free agents in terms of data-efficiency, computation time, and final performance and overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent who learns how to perform rare surgeries without enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment. Also, as a future work on representation learning, the ability to scale latent imagination to higher visual complexity environments can be investigated.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at https://github.com/google-research/dreamer. <br />
<br />
== Critique ==<br />
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.<br />
<br />
The model components in Appendix A have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.<br />
<br />
Another fact about Dreamer is that it learns long-horizon behaviors purely by latent imagination, unlike previous approaches. It is also applicable to tasks with discrete actions and early episode termination.<br />
<br />
<br />
Learning a policy from visual inputs is a quite interesting research approach in RL. This paper steps in this direction by improving existing model-based methods (the world models and PlaNet) using the actor-critic approach, but in my point of view, their method was an incremental contribution as back-propagating gradients through values and dynamics has been studied in previous works.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes&diff=48688Functional regularisation for continual learning with gaussian processes2020-12-01T15:46:42Z<p>Dmaleki: </p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularize Continual Learning (CL) so that it doesn't forget previously learned tasks. This method, referred to as functional regularization for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then utilized in optimization as a regularizer to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimization of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularization-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularize the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularizing weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimized using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularization-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularizes the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinte-dimensional generalization of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimization in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterized by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrized by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|]]<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarize information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrize <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrized by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimizing the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximizing the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularize the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximized is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]<br />
<br />
As a result, we regularize the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimization computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularization term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrized by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. The round dots represents the data points and each color corresponds to a different label.<br />
<br />
[[File:inducing-points.jpg|500px]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasize that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using the GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimized using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularization-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimization scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularization terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularization terms can make optimization more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=48642ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-12-01T05:09:40Z<p>Dmaleki: </p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer parameters than BERT-large, but it still produces better results. The changes made to BERT model are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropouts from the model.<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption, but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass but reduces memory requirements in a gradient checkpoint technique technique. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that their parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 (where H is the size of the hidden layer). Next, we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However, if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is the size of the vocabulary and is usually quite large. For example, <math display="inline">\\V</math> equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing. For example, one may only share feed-forward network parameters or only share attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. In both cases, sharing all the parameters has a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters instead of the attention parameters. Given this, the authors have decided to share all the parameters across the layers, resulting in much smaller number of parameters, which in turn enable them to have larger hidden layers, which is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al., 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modelling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss where positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However, experiments show that NSP is not effective, and it should also be pointed out that NSP loss overlaps with MLM loss in terms of the task in topic prediction. In fact, the necessity of NSP loss has been questioned in the literature (Lample and Conneau,2019; Joshi et al., 2019). The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal. For example, the model can easily be trained to learn "I love cats" and "I had sushi for lunch" are not coherent as they are already very different topic-wise, but might not be able to tell that "I love cats" and "my mom owned a dog" are not next to each other.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again a binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to an acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png | center |800px]]<br />
<br />
<br />
'''What does sentence order prediction (SOP) look like?'''<br />
<br />
'''Through a mathematical lens:'''<br />
<br />
First we will present some variable as follows. <math display="inline">\vec{s_{j}}</math> is the <math display="inline">j^{th}</math> textual segment in a document, <math display="inline"> D </math>. Here <math display="inline"> \vec{s_{j}} \in span \{ \vec{w^{j}_1}, ... , \vec{w^{j}_n} \} </math>. <math display="inline"> \vec{w^{j}_i} </math> is the <math display="inline">i^{th}</math> word in <math display="inline">\vec{s_{j}}</math>. Now the task of SOP is given <math display="inline">\vec{s_{k}}</math> to predict whether a following textual segment <math display="inline">\vec{s_{k+1}}</math> is truly the following sentence or not. Here the subscripts <math display="inline">k</math> and <math display="inline">k+1</math> denote the ordering. The task is predict whether <math display="inline">\vec{s_{k+1}}</math> is actually <math display="inline">\vec{s_{j+1}}</math> or <math display="inline">\vec{s_{j}}</math>.<br />
<br />
<br />
'''Through a visual lens:'''<br />
<br />
[[File:SOP.PNG | center | 800px]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. Table 8 below shows the effect of removing dropout. They also observe that the model does not overfit the data even after 1M steps of training. The authors point out that empirical [8] and theoretical [9] evidence suggests that batch normalization in combination with dropout may have harmful results, particularly in convolutional neural networks. They speculate that dropout may be having a similar effect here.<br />
[[File:dropout.png | center |800px]]<br />
<br />
===Effect of Network Depth and Width===<br />
<br />
In table 11, we can see the effect of increasing the number of layers. In all these settings the size of hidden layers is 1024. It appears that with increasing the depth of the model we get better and better results until the number of layers reaches 24. However, it seems that increasing the depth from 24 to 48 will decline the performance of the model.<br />
<br />
[[File:ALBERT_table11.png | center |800px]]<br />
<br />
Table 12 shows the effect of the width of the model. It was observed that the accuracy of the model improved till the width of the network reaches 4096 and after that, any further increase in the width appears to have a decline in the accuracy of the model.<br />
[[File:ALBERT_table12.png | center |800px]]<br />
<br />
Table 13 investigates if we need a very deep model when the model is very wide. It seems that when we have H=4096, the difference between the performance of models with 12 or 24 layers is negligible. <br />
[[File:ALBERT_table13.png | center |800px]]<br />
<br />
These three tables illustrate the logic behind the authors' decisions about the width and depth of the model.<br />
== Source Code ==<br />
<br />
The official source code is available at: https://github.com/google-research/ALBERT<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png | center |800px]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of the same number of steps. Here are the results:<br />
[[File:sameTime.png | center |800px]]<br />
<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better to look at the results when they let the BERT-large be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favour of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that the BERT-large is better at first but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
This paper proposed an embedding factorization to reduce the number of parameters in the embedding dimension, but the authors didn't cite or compare to related approaches. However, this kind of dimensionality reduction has been explored with other techniques, for example for knowledge distillation, quantization, or even adaptive input/softmax.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341<br />
<br />
[6]: Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. 2019. URL https://arxiv.org/abs/1907.10529<br />
<br />
[7]: Guillaume Lample and Alexis Conneau. Crosslingual language model pretraining. 2019. URL https://arxiv.org/abs/1901.07291<br />
<br />
[8]: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.<br />
<br />
[9]: Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690, 2019</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE&diff=48629SuperGLUE2020-12-01T03:47:27Z<p>Dmaleki: </p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3] and BERT[4], have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since its release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. However, the latest GPT and BERT models far outperform these benchmarks and strike a need for a more robust and difficult benchmark.<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification and fake news detection. The fine-tuned models beat many of the human labelers who weren’t experts in the domain. Thus, it creates a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== Improvements to GLUE ==<br />
SuperGLUE follows the design principles of GLUE but seeks to improve on its predecessor in many ways:<br />
<br />
'''More challenging tasks:''' SuperGLUE contains the two hardest tasks in GLUE and open tasks that are difficult to current NLP approaches.<br />
<br />
'''More diverse task formats:''' SuperGLUE expands GLUE task formats to include coreference resolution and question answering.<br />
<br />
'''Comprehensive human baselines:''' Human performance estimates are provided for all benchmark tasks.<br />
<br />
'''Improved code support:''' SuperGLUE is build around the widely used tools including PyTorch and AllenNLP.<br />
<br />
'''Refined usage rules:''' SuperGLUE leaderboard ensures fair competition and full credit to creators.<br />
<br />
== Design Process ==<br />
<br />
SuperGLUE is designed to be widely applicable to many different NLP tasks. That being said, in designing SuperGLUE, certain criteria needed to be established to determine whether a NLP task can be completed. The authors specified six such requirements, which are listed below.<br />
<br />
#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.<br />
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.<br />
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgements of the output quality.<br />
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.<br />
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.<br />
#'''License:''' Task data must be under a license that allows the redistribution and use for research.<br />
<br />
To select tasks included in the benchmarks, the authors put a public request for NLP tasks and received many. From this, they filtered the tasks according to the criteria above and eliminated any tasks that could not be used due to licensing issues or other problems.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to most college-educated English speakers' capabilities and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written to keep the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. The Passages are taken from seven domains including news, fiction, and historical text etc.<br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the available choices. The articles are extracted from CNN and Daily Mail.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
The table below briefly corresponds to the different tasks included in SuperGLUE along with the task type and size of the datasets. In the table, WSD stands for word sense disambiguation, NLI is natural language inference, coref. is coreference resolution, and QA is question answering.<br />
<br />
[[File: supergluetasks.png]]<br />
<br />
In the following chart[18], you can see the differences between the different benchmarks.<br />
[[File: superglue.JPG]]<br />
<br />
<br />
===Scoring===<br />
With GLUE, they seek to give a sense of aggregate system performance overall tasks by averaging all tasks scores. Lacking a fair criterion to weigh the contributions of each task to the overall score, they opt for the simple approach of weighing each task equally and for tasks with multiple metrics, first averaging those metrics to get a task score.<br />
<br />
== Model Analysis ==<br />
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperGLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset.<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The human results for WiC, MltiRC, RTE, and ReCoRD were already available on [15], [12], [17], and [13] respectively. However, for the remaining tasks, the authors employed crowdworkers to reannotate a sample of each test set according to the methods used in [17]. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Source Code ==<br />
<br />
The source code is available at https://github.com/nyu-mll/jiant .<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.<br />
<br />
[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.<br />
<br />
[18] Storks, Shane, Qiaozi Gao, and Joyce Y. Chai. "Recent advances in natural language inference: A survey of benchmarks, resources, and approaches." arXiv preprint arXiv:1904.01172 (2019).</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE&diff=48628SuperGLUE2020-12-01T03:45:28Z<p>Dmaleki: /* References */</p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3] and BERT[4], have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since its release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. The latest GPT and BERT models, however, far outperform these benchmarks and strike a need for a more robust and difficult benchmark.<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification and fake news detection. The fine-tuned models beat many of the human labelers who weren’t experts in the domain. Thus, it creates a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== Improvements to GLUE ==<br />
SuperGLUE follows the design principles of GLUE but seeks to improve on its predecessor in many ways:<br />
<br />
'''More challenging tasks:''' SuperGLUE contains the two hardest tasks in GLUE and open tasks that are difficult to current NLP approaches.<br />
<br />
'''More diverse task formats:''' SuperGLUE expands GLUE task formats to include coreference resolution and question answering.<br />
<br />
'''Comprehensive human baselines:''' Human performance estimates are provided for all benchmark tasks.<br />
<br />
'''Improved code support:''' SuperGLUE is build around the widely used tools including PyTorch and AllenNLP.<br />
<br />
'''Refined usage rules:''' SuperGLUE leaderboard ensures fair competition and full credit to creators.<br />
<br />
== Design Process ==<br />
<br />
SuperGLUE is designed to be widely applicable to many different NLP tasks. That being said, in designing SuperGLUE, certain criteria needed to be established to determine whether a NLP task can be completed. The authors specified six such requirements, which are listed below.<br />
<br />
#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.<br />
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.<br />
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgements of the output quality.<br />
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.<br />
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.<br />
#'''License:''' Task data must be under a license that allows the redistribution and use for research.<br />
<br />
To select tasks that would be included in the benchmarks, the authors put a public request for NLP tasks and received many. From this, they filtered the tasks according to the criteria above as well as eliminating any tasks that could not be used due to licensing issues or other problems.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to the capabilities of most college-educated English speakers and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written with the goal of keeping the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. The Passages are taken from seven domains including news, fiction, and historical text etc.<br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the available choices. The articles are extracted from CNN and Daily Mail.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
The table below briefly corresponds to the different tasks included in SuperGLUE along with the task type and size of the datasets. In the table, WSD stands for word sense disambiguation, NLI is natural language inference, coref. is coreference resolution, and QA is question answering.<br />
<br />
[[File: supergluetasks.png]]<br />
<br />
In the following chart[18], you can see the differences between the different benchmarks.<br />
[[File: superglue.JPG]]<br />
<br />
<br />
===Scoring===<br />
With GLUE, they seek to give a sense of aggregate system performance overall tasks by averaging scores of all tasks. Lacking a fair criterion with which to weight the contributions of each task to the overall score, they opt for the simple approach of weighing each task equally, and for tasks with multiple metrics, first averaging those metrics to get a task score.<br />
<br />
== Model Analysis ==<br />
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperGLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset.<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The human results for WiC, MltiRC, RTE, and ReCoRD were already available on [15], [12], [17], and [13] respectively. However, for the remaining tasks, the authors employed crowdworkers to reannotate a sample of each test set according to the methods used in [17]. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Source Code ==<br />
<br />
The source code is available at https://github.com/nyu-mll/jiant .<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.<br />
<br />
[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.<br />
<br />
[18] Storks, Shane, Qiaozi Gao, and Joyce Y. Chai. "Recent advances in natural language inference: A survey of benchmarks, resources, and approaches." arXiv preprint arXiv:1904.01172 (2019).</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE&diff=48626SuperGLUE2020-12-01T03:44:45Z<p>Dmaleki: /* SuperGLUE Tasks */</p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3] and BERT[4], have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since its release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. The latest GPT and BERT models, however, far outperform these benchmarks and strike a need for a more robust and difficult benchmark.<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification and fake news detection. The fine-tuned models beat many of the human labelers who weren’t experts in the domain. Thus, it creates a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== Improvements to GLUE ==<br />
SuperGLUE follows the design principles of GLUE but seeks to improve on its predecessor in many ways:<br />
<br />
'''More challenging tasks:''' SuperGLUE contains the two hardest tasks in GLUE and open tasks that are difficult to current NLP approaches.<br />
<br />
'''More diverse task formats:''' SuperGLUE expands GLUE task formats to include coreference resolution and question answering.<br />
<br />
'''Comprehensive human baselines:''' Human performance estimates are provided for all benchmark tasks.<br />
<br />
'''Improved code support:''' SuperGLUE is build around the widely used tools including PyTorch and AllenNLP.<br />
<br />
'''Refined usage rules:''' SuperGLUE leaderboard ensures fair competition and full credit to creators.<br />
<br />
== Design Process ==<br />
<br />
SuperGLUE is designed to be widely applicable to many different NLP tasks. That being said, in designing SuperGLUE, certain criteria needed to be established to determine whether a NLP task can be completed. The authors specified six such requirements, which are listed below.<br />
<br />
#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.<br />
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.<br />
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgements of the output quality.<br />
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.<br />
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.<br />
#'''License:''' Task data must be under a license that allows the redistribution and use for research.<br />
<br />
To select tasks that would be included in the benchmarks, the authors put a public request for NLP tasks and received many. From this, they filtered the tasks according to the criteria above as well as eliminating any tasks that could not be used due to licensing issues or other problems.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to the capabilities of most college-educated English speakers and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written with the goal of keeping the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. The Passages are taken from seven domains including news, fiction, and historical text etc.<br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the available choices. The articles are extracted from CNN and Daily Mail.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
The table below briefly corresponds to the different tasks included in SuperGLUE along with the task type and size of the datasets. In the table, WSD stands for word sense disambiguation, NLI is natural language inference, coref. is coreference resolution, and QA is question answering.<br />
<br />
[[File: supergluetasks.png]]<br />
<br />
In the following chart[18], you can see the differences between the different benchmarks.<br />
[[File: superglue.JPG]]<br />
<br />
<br />
===Scoring===<br />
With GLUE, they seek to give a sense of aggregate system performance overall tasks by averaging scores of all tasks. Lacking a fair criterion with which to weight the contributions of each task to the overall score, they opt for the simple approach of weighing each task equally, and for tasks with multiple metrics, first averaging those metrics to get a task score.<br />
<br />
== Model Analysis ==<br />
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperGLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset.<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The human results for WiC, MltiRC, RTE, and ReCoRD were already available on [15], [12], [17], and [13] respectively. However, for the remaining tasks, the authors employed crowdworkers to reannotate a sample of each test set according to the methods used in [17]. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Source Code ==<br />
<br />
The source code is available at https://github.com/nyu-mll/jiant .<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.<br />
<br />
[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE&diff=48624SuperGLUE2020-12-01T03:44:28Z<p>Dmaleki: /* SuperGLUE Tasks */</p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3] and BERT[4], have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since its release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. The latest GPT and BERT models, however, far outperform these benchmarks and strike a need for a more robust and difficult benchmark.<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification and fake news detection. The fine-tuned models beat many of the human labelers who weren’t experts in the domain. Thus, it creates a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== Improvements to GLUE ==<br />
SuperGLUE follows the design principles of GLUE but seeks to improve on its predecessor in many ways:<br />
<br />
'''More challenging tasks:''' SuperGLUE contains the two hardest tasks in GLUE and open tasks that are difficult to current NLP approaches.<br />
<br />
'''More diverse task formats:''' SuperGLUE expands GLUE task formats to include coreference resolution and question answering.<br />
<br />
'''Comprehensive human baselines:''' Human performance estimates are provided for all benchmark tasks.<br />
<br />
'''Improved code support:''' SuperGLUE is build around the widely used tools including PyTorch and AllenNLP.<br />
<br />
'''Refined usage rules:''' SuperGLUE leaderboard ensures fair competition and full credit to creators.<br />
<br />
== Design Process ==<br />
<br />
SuperGLUE is designed to be widely applicable to many different NLP tasks. That being said, in designing SuperGLUE, certain criteria needed to be established to determine whether a NLP task can be completed. The authors specified six such requirements, which are listed below.<br />
<br />
#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.<br />
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.<br />
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgements of the output quality.<br />
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.<br />
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.<br />
#'''License:''' Task data must be under a license that allows the redistribution and use for research.<br />
<br />
To select tasks that would be included in the benchmarks, the authors put a public request for NLP tasks and received many. From this, they filtered the tasks according to the criteria above as well as eliminating any tasks that could not be used due to licensing issues or other problems.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to the capabilities of most college-educated English speakers and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written with the goal of keeping the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. The Passages are taken from seven domains including news, fiction, and historical text etc.<br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the available choices. The articles are extracted from CNN and Daily Mail.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
The table below briefly corresponds to the different tasks included in SuperGLUE along with the task type and size of the datasets. In the table, WSD stands for word sense disambiguation, NLI is natural language inference, coref. is coreference resolution, and QA is question answering.<br />
<br />
[[File: supergluetasks.png]]<br />
<br />
In the following chart[10], you can see the differences between the different benchmarks.<br />
[[File: superglue.JPG]]<br />
<br />
<br />
===Scoring===<br />
With GLUE, they seek to give a sense of aggregate system performance overall tasks by averaging scores of all tasks. Lacking a fair criterion with which to weight the contributions of each task to the overall score, they opt for the simple approach of weighing each task equally, and for tasks with multiple metrics, first averaging those metrics to get a task score.<br />
<br />
== Model Analysis ==<br />
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperGLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset.<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The human results for WiC, MltiRC, RTE, and ReCoRD were already available on [15], [12], [17], and [13] respectively. However, for the remaining tasks, the authors employed crowdworkers to reannotate a sample of each test set according to the methods used in [17]. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Source Code ==<br />
<br />
The source code is available at https://github.com/nyu-mll/jiant .<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.<br />
<br />
[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:superglue.JPG&diff=48623File:superglue.JPG2020-12-01T03:42:14Z<p>Dmaleki: </p>
<hr />
<div></div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=48596BERTScore: Evaluating Text Generation with BERT2020-12-01T02:40:27Z<p>Dmaleki: </p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approach or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> <br />
<br />
Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> <br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and the candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=48593BERTScore: Evaluating Text Generation with BERT2020-12-01T02:38:25Z<p>Dmaleki: /* Critique & Future Prospects */</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. The idea behind this paper is to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approach or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due in part to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> <br />
<br />
Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> <br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized whereas some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that isn't detected by the BLEU score.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and the candidate sentences are represented using contextual embeddings. This is inspired by word embedding techniques but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the results for the best performing model. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is very simple in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=48491Self-Supervised Learning of Pretext-Invariant Representations2020-11-30T19:25:18Z<p>Dmaleki: /* Critiques */</p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations using a large number of data with pre-defined semantic annotations. Some examples of these annotations are class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data that is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community to find image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''Self-Supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show, there is no need for using class labels or bounding boxes in self-supervised learning. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of used transformation. Two of the most used pretext tasks are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. Also, in the jigsaw task which is more complicated than the rotation task, unlabeled images are cropped into 9 patches and then, the image is perturbed by randomly permuting the nine patches. Each permutation falls into one of the 35 classes according to a formula. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to greyscale, and image reconstruction, where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
Although the proposed Pretext Tasks have obtained promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations the are common between the original images and the transformed ones. This idea is supported by the fact that humans are able to recognize these transformed images. This hints us to try to develop a method that obtains image representations that are common between the original and transformed images, in other words, image representations that are transformation invariant. The summarized paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that learns to obtain Self-Supervised image representations that as opposed to Pretext Tasks are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Figure 3: Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image ,<math>I</math>, in the Dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t)<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two set of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t}}{\tau}) \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t}}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}}}{\tau}) \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers) , <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math> , increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, are increased. During training a memory bank [], <math>m_I</math>, of dataset image representations are used to access the representations of the dataset images including the negative images. The proposed PIRL model is shown in Figure (4). Finally, the contrastive loss in equation (5) does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
Where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
For object detection, a Faster R-CNN[] model is used with a ResNet-50 backbone which is pre-trained using PIRL and other Self-Supervised methods. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised based methods, '''for the first time it outperforms Supervised Pretraining on object detection'''. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the ResNet-50 pretrained model is fixed and used as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results show that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pretrained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, the results show that PIRL performs best among different Self-Supervised Learning methods. Even, it is able to perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 400px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of image transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL is able to use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure (10), increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. That may result in better image representations.<br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=48459Self-Supervised Learning of Pretext-Invariant Representations2020-11-30T16:44:36Z<p>Dmaleki: </p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations using a large number of data with pre-defined semantic annotations. Some examples of these annotations are class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data that is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of classes and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community to find image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast growing areas of research that tries to address this problem is '''Self-Supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show, there is no need for using class labels or bounding boxes in self-supervised learning. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of used transformation. Two of the most used pretext tasks are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. Also, in jigsaw task which is more complicated than the rotation task, unlabeled images are cropped into 9 patches and then, the image is perturbed by randomly permuting the nine patches. Each permutation falls into one of the 35 classes according to a formula. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to greyscale, and image reconstruction, where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
Although the proposed Pretext Tasks have obtained promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations the are common between the original images and the transformed ones. This idea is supported by the fact that humans are able to recognize these transformed images. This hints us to try to develop a method that obtains image representations that are common between the original and transformed images, in other words, image representations that are transformation invariant. The summarized paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that learns to obtain Self-Supervised image representations that as opposed to Pretext Tasks are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Figure 3: Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image ,<math>I</math>, in the Dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t)<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two set of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t}}{\tau}) \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t}}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}}}{\tau}) \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers) , <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math> , increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, are increased. During training a memory bank [], <math>m_I</math>, of dataset image representations are used to access the representations of the dataset images including the negative images. The proposed PIRL model is shown in Figure (4). Finally, the contrastive loss in equation (5) does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
Where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
For object detection, a Faster R-CNN[] model is used with a ResNet-50 backbone which is pre-trained using PIRL and other Self-Supervised methods. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised based methods, '''for the first time it outperforms Supervised Pretraining on object detection'''. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the ResNet-50 pretrained model is fixed and used as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results show that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behinds Supervised Pretrained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, the results show that PIRL performs best among different Self-Supervised Learning methods. Even, it is able to perform better than Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from ImageNet dataset and representations of the chosen images and their transformed version are obtained using one time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed images representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of image transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and its effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL is able to use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure (10), increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy on image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method on obtaining transformation invariant image representations. However, the authors can extend their work with richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering based methods [7,8]. That may result in better image representations.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48443CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-30T15:19:27Z<p>Dmaleki: </p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images is not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48440CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-30T14:17:37Z<p>Dmaleki: </p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images is not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48439CRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-30T14:15:33Z<p>Dmaleki: </p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various methods of data augmentation including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
<br />
[[File:histo.png|500px|center]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
This paper revealed that if a strong data-augmentation be employed, as little as a single image is sufficient for self-supervision techniques to learn the first few layers of popular CNNs. However, even the presence of millions of images are not enough for learning the deeper layers, and supervision might still be necessary. The results confirmed that the weights of the first layers of deep networks contain limited information about natural images. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” inInternational Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48081Roberta2020-11-30T01:21:44Z<p>Dmaleki: </p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
RoBERTa has outperformed state of the art in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you simply pre-trained BERT in a better way, you may reach the same performances as RoBERTa achieved.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48079Roberta2020-11-30T01:19:32Z<p>Dmaleki: </p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
It is clear that RoBERTa has outperformed the state of the art in almost all of the GLUE tasks including for ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically say the reasons why they make gains may be questionable and if you simply pre-trained bert in a better way you can reach the same performances<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Com- putational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48077Roberta2020-11-30T01:19:12Z<p>Dmaleki: </p>
<hr />
<div>RoBERTa: A Robustly Optimized BERT Pretraining Approach<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
It is clear that RoBERTa has outperformed the state of the art in almost all of the GLUE tasks including for ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically say the reasons why they make gains may be questionable and if you simply pre-trained bert in a better way you can reach the same performances<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Com- putational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=48076stat940F212020-11-30T01:18:22Z<p>Dmaleki: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]]<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS Summary] || [https://drive.google.com/file/d/1mZVlF2UvJ2lGjuVcN5SYqBuO4jZjuCcU/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation Summary] || [https://drive.google.com/file/d/1OUx64_pTZzCQAdo_fmy_9h9NbuccTnn6/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] || Learn<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification Summary] || [https://www.youtube.com/watch?v=jG57QgY71yU video]<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes Summary]|| Learn<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] || [https://youtu.be/D54qsSkqryk video] or Learn<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||RoBERTa: A Robustly Optimized BERT Pretraining Approach ||[https://openreview.net/forum?id=SyxS0T4tvS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta Summary] || [https://youtu.be/JdfvvYbH-2s Presentation Video]<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT Summary] || Learn<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||A critical analysis of self-supervision, or what we can learn from a single image|| [https://openreview.net/pdf?id=B1esx6EYvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION Summary]||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Self-Supervised Learning of Pretext-Invariant Representations || [https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations Summary] ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Summary]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48072Roberta2020-11-30T01:16:55Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
It is clear that RoBERTa has outperformed the state of the art in almost all of the GLUE tasks including for ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically say the reasons why they make gains may be questionable and if you simply pre-trained bert in a better way you can reach the same performances<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Com- putational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48070Roberta2020-11-30T01:16:25Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
It is clear that RoBERTa has outperformed the state of the art in almost all of the GLUE tasks including for ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically say the reasons why they make gains may be questionable and if you simply pre-trained bert in a better way you can reach the same performances<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Com- putational Linguistics (NAACL).<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
[8]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48055Roberta2020-11-30T01:10:41Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
It is clear that RoBERTa has outperformed the state of the art in almost all of the GLUE tasks including for ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically say the reasons why they make gains may be questionable and if you simply pre-trained bert in a better way you can reach the same performances</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48053Roberta2020-11-30T01:10:24Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
It is clear that RoBERTa has outperformed the state of the art in almost all of the GLUE tasks including for ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|600px|center]]<br />
[[File:race.JPG|600px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically say the reasons why they make gains may be questionable and if you simply pre-trained bert in a better way you can reach the same performances</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:race.JPG&diff=48050File:race.JPG2020-11-30T01:09:31Z<p>Dmaleki: </p>
<hr />
<div></div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:squad.JPG&diff=48048File:squad.JPG2020-11-30T01:09:23Z<p>Dmaleki: </p>
<hr />
<div></div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48047Roberta2020-11-30T01:09:16Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
It is clear that RoBERTa has outperformed the state of the art in almost all of the GLUE tasks including for ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
<br />
== Conclusion ==<br />
In conclusion, they basically say the reasons why they make gains may be questionable and if you simply pre-trained bert in a better way you can reach the same performances</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48035Roberta2020-11-30T01:04:10Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
It is clear that RoBERTa has outperformed the state of the art in almost all of the GLUE tasks including for ensemble models.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47981Roberta2020-11-30T00:21:20Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47980Roberta2020-11-30T00:21:01Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|400px|center]]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47978Roberta2020-11-30T00:19:58Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
(1) <b>Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
(2) Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
(3) Full-Sentences: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
(4) Doc-Sentences: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
(1) BookCorpus + English Wikipedia (16GB): This is the data on which BERT is trained.<br />
<br />
(2) CC-News (76GB): The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
(3) OpenWebText (38GB): Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
(4) Stories (31GB): A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|400px|center]]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47976Roberta2020-11-30T00:17:46Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
(1) Segment-Pair + NSP: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
(2) Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
(3) Full-Sentences: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
(4) Doc-Sentences: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
(1) BookCorpus + English Wikipedia (16GB): This is the data on which BERT is trained.<br />
<br />
(2) CC-News (76GB): The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
(3) OpenWebText (38GB): Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
(4) Stories (31GB): A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|400px|center]]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47975Roberta2020-11-30T00:17:33Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
(1) Segment-Pair + NSP: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
(2) Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
(3) Full-Sentences: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
(4) Doc-Sentences: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
(1) BookCorpus + English Wikipedia (16GB): This is the data on which BERT is trained.<br />
<br />
(2) CC-News (76GB): The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
(3) OpenWebText (38GB): Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
(4) Stories (31GB): A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:data.JPG|400px|center]]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47974Roberta2020-11-30T00:15:54Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
<br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
(1) Segment-Pair + NSP: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (which is the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
(2) Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
(3) Full-Sentences: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
(4) Doc-Sentences: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ====<br />
Next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization in contrast to BERT which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset they can achieve a higher performance on downstream tasks. They used different types of datasets for their pre-training that you can see a list of them below.<br />
<br />
(1) BookCorpus + English Wikipedia (16GB): This is the data on which BERT is trained.<br />
<br />
<br />
(2) CC-News (76GB): The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<br />
(3) OpenWebText (38GB): Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<br />
(4) Stories (31GB): A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:data.JPG|400px|center]]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47950Roberta2020-11-29T23:49:39Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
<br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts these tokens. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|600px|center]]<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
1-Segment-Pair + NSP: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (which is the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
2-Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, it is evident that the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
3-Full-Sentences: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the length of the sequence is at most 512.<br />
<br />
<br />
4-Doc-Sentences: Similar to the full sentences setting they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table you can see the performance of each setting on each downstream task. As you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
Next thing they tried to investigate was the importance of the larg batchsize. They tried different number of batch size and realize the 2k batch size has the best performance among the other ones. Below table show their results for different number of batch size.<br />
<br />
[[File:batch_size.JPG|600px|center]]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47925Roberta2020-11-29T23:20:09Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
<br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts these tokens. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|600px|center]]<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
1-Segment-Pair + NSP: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (which is the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
2-Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, it is evident that the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
3-Full-Sentences: Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the length of the sequence is at most 512.<br />
<br />
<br />
4-Doc-Sentences: This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, the batch size is also varied so as to match the number of tokens to that of the Full-Sentences.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47908Roberta2020-11-29T23:14:11Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
<br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts these tokens. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|600px|center]]<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different setting to show they would help with eliminating the NSP loss in pretraining.<br />
Segment-Pair + NSP: This is the input representation used in the BERT implementation. Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (which is the maximum fixed sequence length for the BERT model).<br />
Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, it is evident that the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
Full-Sentences: Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the length of the sequence is at most 512.<br />
Doc-Sentences: This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, the batch size is also varied so as to match the number of tokens to that of the Full-Sentences.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47883Roberta2020-11-29T23:00:32Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
<br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts these tokens. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model.[[File:mask_result.png|600px|center]]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:NSP_loss.JPG&diff=47868File:NSP loss.JPG2020-11-29T22:43:21Z<p>Dmaleki: </p>
<hr />
<div></div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:mask_result.png&diff=47867File:mask result.png2020-11-29T22:43:05Z<p>Dmaleki: </p>
<hr />
<div></div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:dataset.JPG&diff=47866File:dataset.JPG2020-11-29T22:42:42Z<p>Dmaleki: </p>
<hr />
<div></div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:batch_size.JPG&diff=47865File:batch size.JPG2020-11-29T22:42:24Z<p>Dmaleki: </p>
<hr />
<div></div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47854Roberta2020-11-29T22:23:59Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which is trying to investigate the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD[7], RACE[8] and show their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
<br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining is masking a few tokens from each sequence at random and then predicting these tokens. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model.[[File:Example.jpg]]</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47842Roberta2020-11-29T22:03:29Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which is trying to investigate the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47659Roberta2020-11-29T08:53:13Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta is a replication of BERT pretraining which is trying to investigate the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture with 2 training objectives, they uses maskes language modeling(MLM) and next sentence prediction(NSP) as their objectives. In the MLM objectives</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=47656Roberta2020-11-29T08:45:34Z<p>Dmaleki: </p>
<hr />
<div>== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta is a replication of BERT pretraining which is trying to investigate the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the down stream tasks.<br />
<br />
== Background ==<br />
Roberta is a</div>Dmaleki