http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=A2chanan&feedformat=atomstatwiki - User contributions [US]2022-01-19T11:54:38ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=49859STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-12-10T10:36:30Z<p>A2chanan: /* Critique */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed with it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. The deep bidirectional model is strictly more powerful than the left to right or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pre-training method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pairs of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by masks to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we will compare BERT with previous language models, particularly ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it can capture more context information than GPT and tends to perform better when context information from both sides is important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture, we can better understand the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that only BERT representations are jointly conditioned on both directions' context in all layers among these three models.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
From Table 3 it can be observed that <math>BERT_{LARGE}</math> and <math>BERT_{BASE}</math> performance significantly better than the previous state-of-the-art models with 7% and 4.5% improvement in average accuracy over the previous best model (OpenAI GPT). Also, it is noteworthy that OpenAI GPT and <math>BERT_{BASE}</math> have similar architecture and the only difference is that <math>BERT_{BASE}</math> makes use of attention masks and gets and improvement of 4.5%. It can also be seen that <math>BERT_{LARGE}</math> outperforms <math>BERT_{BASE}</math> across all the datasets and the difference is significant when there is less training data available.<br />
<br />
== Critique ==<br />
Bert showed that transformers could be a good architecture to solve NLP downstream tasks but they didn't care about choosing their hyper-parameters or even training and pre-training choices. As Albert[3], RoBERTa[4] shown in their paper, by choosing better hyper-parameters or even training choices, we can have a similar or even better performance within less time and training data.<br />
<br />
== Repository ==<br />
<br />
A github repository for BERT is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== Fun facts ==<br />
<br />
A collection of BERT-related papers published in 2019. The y-axis is the log of the citation count (based on Google Scholar).<br />
[[File:BERT-related.gif|800px|center]]<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)<br />
<br />
[3] Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations." arXiv preprint arXiv:1909.11942 (2019).<br />
[4] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=49858STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-12-10T10:24:17Z<p>A2chanan: /* BERT */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed with it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. The deep bidirectional model is strictly more powerful than the left to right or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pre-training method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pairs of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by masks to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we will compare BERT with previous language models, particularly ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it can capture more context information than GPT and tends to perform better when context information from both sides is important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture, we can better understand the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that only BERT representations are jointly conditioned on both directions' context in all layers among these three models.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
From Table 3 it can be observed that <math>BERT_{LARGE}</math> and <math>BERT_{BASE}</math> performance significantly better than the previous state-of-the-art models with 7% and 4.5% improvement in average accuracy over the previous best model (OpenAI GPT). Also, it is noteworthy that OpenAI GPT and <math>BERT_{BASE}</math> have similar architecture and the only difference is that <math>BERT_{BASE}</math> makes use of attention masks and gets and improvement of 4.5%. It can also be seen that <math>BERT_{LARGE}</math> outperforms <math>BERT_{BASE}</math> across all the datasets and the difference is significant when there is less training data available.<br />
<br />
== Critique ==<br />
Bert showed that transformers could be a good architecture to solve NLP downstream tasks but they didn't care about choosing their hyperparameters or even training and pre-training choices. As Albert[3], RoBERTa[4] shown in their paper, by choosing better hyperparameters or even training choices, we can have a similar or even better performance with less time and training data.<br />
<br />
== Repository ==<br />
<br />
A github repository for BERT is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== Fun facts ==<br />
<br />
A collection of BERT-related papers published in 2019. The y-axis is the log of the citation count (based on Google Scholar).<br />
[[File:BERT-related.gif|800px|center]]<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)<br />
<br />
[3] Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations." arXiv preprint arXiv:1909.11942 (2019).<br />
[4] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=49857DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-12-10T10:07:57Z<p>A2chanan: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning (RL) is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning. It refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behavior over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance. Along with the latent space representation, an actor-critic model is used to learn the reaction and optimize the behavior of the agent. The proposed method is based on model-free RL with latent state representation that is learned via prediction. The term "model-free" in RL refers to not having an explicit model of the environment and its dynamics - there is still a model of the agent being learned. The authors have changed the belief representations to learn a critic, or value function, directly on latent state samples which help to enable scaling to more complex tasks.<br />
<br />
<br />
<br />
<br />
The main finding of the paper is that long-horizon behaviors can be learned by latent imagination. This avoids the short sightedness that comes with using finite imagination horizons. The authors have also managed to demonstrate empirical performance for visual control by evaluating the model on image inputs.<br />
<br />
[[File:Figure1 paper.png|100px|center]]<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math> as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively. Informally, <math>V_{\pi}</math> tells one how good a state is in terms of the expected return when starting in the state <math>s</math> and then following the policy <math>\pi</math>. Similarly <math>Q_{\pi}</math> gives the value of the expected return starting from the state <math>s</math>, taking the action <math>a</math>, and subsequently following the policy <math>\pi</math>. <br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
[[File:rl_loop.png|350px|center]]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take while learning is limited, for example due to computational resource limitations, sensitivity of the environment, or physical resource constraints. Thus, it is difficult to sufficiently interact with the environment until an accurate representation of the world is learned. The proposed method in this paper aims to solve this problem by "imagining" the states and rewards that an action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world with the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. From a high-level perspective, Dreamer first learns latent dynamics from past experience. Then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 600px | center]]<br />
<br />
<br />
Let's look at Dreamer in detail. It consists of :<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math>o_{t}</math> is the observation at time <math>t</math> and <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
The main three components of agent learning in imagination are dynamics learning, behaviour learning, and environment interaction. In the compact latent space of the world model, the behaviour is learned by predicting hypothetical trajectories. Throughout the agent's lifetime, Dreamer performs the following operations either in parallel or interleaved as shown in Figure 3 and Algorithm 1:<br />
<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behaviour Learning: In the latent space, the agent predicts state values and actions that maximize future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:ashraf98.png|frameless|700px|Dreamer algorithm|center]]<br />
<br />
Notice that three neural networks are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively. The action model tries to solve the imagination environment by predicting various actions. Meanwhile, the value model estimates the expected rewards that the action model will achieve. Hence, these two models are trained cooperatively whereby the action model tries to maximize the estimated value while the value model gives the estimate based on the action model's actions.<br />
<br />
=== The Markovianity Question ===<br />
<br />
The paper formulates visual control as a so-called Partially Observable Markov Decision Processs (POMDP) in discrete time. Since the goal is for an agent to maximize its sum of rewards in a Markovian setting, this puts the model squarely in the category of reinforcement learning. In this subsection we provide a lengthier discussion on this Markovian assumption.<br />
<br />
Note that the transition distribution provided in the representation and transition models are Markovian in the states <math>s_t</math> and <math>a_t</math>. This mimics the dynamics in a non-linear Kalman filter and hidden Markov models. These techniques are described in the papers by Rabiner and Juang [5] as well as Kalman [6]. The difference with these presentations is that the latent dynamics are conditioned on actions and attempts to predict rewards, which allows the agent to imagine, yet not execute, actions in the provided environment.<br />
<br />
This short memory assumption is useful from a computational perspective as it allows for the problem to be tractable. It is also realistic, as an intelligent agent does not need the entire history of their environment going back all the way to the Big Bang to understand a situation they have not encountered before. We commend the team at UofT and Google Brain for this insight, as it makes their analysis reasonable and easy to understand.<br />
<br />
<br />
== Related Works ==<br />
<br />
Previous Works that exploited latent dynamics can be grouped in 3 sections:<br />
<br />
* Visual Control with latent dynamics by derivative-free policy learning or online planning.<br />
* Augment model-free agents with multi-step predictions.<br />
* Use analytic gradients of Q-values.<br />
<br />
While the later approaches are often for low-dimensional tasks, Dreamer uses analytic gradients to efficiently learn long-horizon behaviours for visual control purely by latent imagination.<br />
<br />
== Results ==<br />
The experiments were performed on 20 different control tasks of Deepmind Control Suite [7]. In the following picture we can see the reward vs the environment steps for a few of the experiments. As we can see the Dreamer outperforms other baseline algorithms. Moreover, the convergence is a lot faster in the Dreamer algorithm. <br />
[[File:dreamer.paper19.png|center|frameless|500px|Rewards vs environment steps of Dreamer and other baseline algorithms]]<br />
<br />
<br />
The figure below summarizes Dreamer's performance compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Using the same hyper parameters for all tasks, Dreamer exceeds previous model-based and model-free agents in terms of data-efficiency, computation time, and final performance. Overall, it achieves the most consistent performance among competing algorithms. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviours with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|center|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
The performance is of Dreamer is also evaluated against state of the art reinforcement learning agents, which is shown below<br />
[[File:CaptureDream.PNG|frameless|center|500px]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. For example, consider a reinforcement learning agent who learns how to perform rare surgeries without enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment. <br />
<br />
As future work on representation learning, the ability to scale latent imagination to higher visual complexity environments can be investigated.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at https://github.com/google-research/dreamer. <br />
<br />
== Critique ==<br />
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.<br />
<br />
The model components in Appendix A have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.<br />
<br />
Another fact about Dreamer is that it learns long-horizon behaviours purely by latent imagination, unlike previous approaches. It is also applicable to tasks with discrete actions and early episode termination.<br />
<br />
Learning a policy from visual inputs is a quite interesting research approach in RL. This paper steps in this direction by improving existing model-based methods (the world models and PlaNet) using the actor-critic approach. However, their method was an incremental contribution as back-propagating gradients through values and dynamics has been studied in previous works.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviours by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.<br />
<br />
[5] Rabiner, Lawrence, and B. Juang. "An introduction to hidden Markov models." IEEE ASSP magazine 3.1 (1986): 4-16.<br />
<br />
[6] Kalman, Rudolph Emil. "A new approach to linear filtering and prediction problems." (1960): 35-45.<br />
<br />
[7] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=49856The Curious Case of Degeneration2020-12-10T09:41:25Z<p>A2chanan: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. This paper exposes the difference between human text and machine text and also highlights the decoding strategies impact machine text. the author For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has a lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have a high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap on the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it is useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. Their results show that (1) likelihood maximization decoding causes degeneration, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation, and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.<br />
However, for many practical real-life problems, we need to consider a large value of k in Top-k sampling. Thus, we can say that nucleus sampling offers a subjective advantage over the Top-k sampling method. Also, we can observe a similar issue in nucleus sampling for choosing the value of p as in Top-k sampling.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from the p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== Tutorial on beam search ==<br />
1- [https://youtu.be/Er2ucMxjdHE Greedy Search]<br />
<br />
2- [https://youtu.be/RLWuzLLSIgw?list=PLCSzVeDv57Z1y0uWZXYX2kq5UUpqA0Mk2 Beam Search]<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=49855When Does Self-Supervision Improve Few-Shot Learning?2020-12-10T03:58:50Z<p>A2chanan: /* Conclusion */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labeled data sets. <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in the hope of successfully classifying previously unseen, but related classes. This paper also resolves the issue where the labeled data is corrupted. It also considers the unseen unlabeled images that belong to the domain which is not present in the training dataset.<br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. The following image indicates the rotation prediction as a proxy task in self-supervision. The proposed method can help against generalization issues where the agent cannot distinguish between newly introduced objects. Self-supervision is an inevitable and powerful method for taking advantage of the vast amount of unlabeled data.<br />
<br />
[[File:rotation prediction 22.png|500px|center]]<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2]. [note 1]<br />
<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique, unlabelled data is utilized which can avoid incurring the computational expenses of labeling and maintaining a massive data set. Images already contain structural information that can be utilized. Many SSL tasks exist, such as removing a part of the data for the agent to reconstruct the lost part. Other methods include task prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning is mutually beneficial. This has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this, a feed-forward convolutional network <math>f(x)</math> maps either a labeled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. When evaluating the accuracy of the model only the mappings of labelled images by the classifier<math>g</math> will be considered. Whereas when training the model both mappings of labelled and unlabelled images by <math>g</math> and <math>h</math> respectively will be utilized. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain, augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image. In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x}, \hat{y}) </math>, where <math>\hat{x}</math> is the identity mapping of <math>x</math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
To assess the proposed method, several datasets, e.g., Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet, have been employed. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task, the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in the range of 35 based on the hamming distance, which calculates the number of permutations needed to convert the tiled and shuffled image back to its original form.<br />
<br />
== Results ==<br />
An N-way k-shot classification task contains N unique classes with k labeled images per class. The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as a function of the distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
== Source Codes ==<br />
<br />
The source code can be found here: https://github.com/cvl-umass/fsl_ssl .<br />
== Conclusion ==<br />
The authors of this paper provide us with great insight into the effects of using SSL as a regularizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
We can observe many interesting results and a unique idea to classify the images using limited training images. However, there are some issues with the model, firstly the model uses the ResNet101 pre-trained model which has used some image classes from the Imagenet dataset which is included in the test dataset.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples, and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
I believe that both strength and weakness of this paper is in its experiments. Different experiments compare a variety self-supervised learning algorithms which is a good point. However, as the reviewers also pointed out, there are some concerns including the level of novelty in the work, the way of creating unlabeled pool, and finally employing pre-trained ResNet-101 on ImageNet and mini-ImageNet in their experiments.<br />
<br />
The authors use a multi-task learning approach with self-supervision. But this approach is already used in various tasks, e.g., domain adaptation, semi-supervised learning, training GANs. So, in my opinion, their approach is incremental based on previous works. Moreover, they showed some quite interesting and even surprising results that may need more consideration such as figure 7 in the summary. I can see some of their claims may not match the results.<br />
<br />
== Notes ==<br />
:1. Model-Agnostic Meta-learning (MAML): Neural networks are performing very well at many tasks, but they often require large datasets. On the contrary, humans are able to learn new skills with little examples. MAML is trained with different tasks, which have the role of training sets, and is used to learn new tasks that are like test sets. Therefore, MAML is able to perform well on tasks with small training sets without overfitting to the data.[5]<br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)<br />
<br />
[5]: Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=49854When Does Self-Supervision Improve Few-Shot Learning?2020-12-10T03:52:53Z<p>A2chanan: /* Introduction */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labeled data sets. <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in the hope of successfully classifying previously unseen, but related classes. This paper also resolves the issue where the labeled data is corrupted. It also considers the unseen unlabeled images that belong to the domain which is not present in the training dataset.<br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. The following image indicates the rotation prediction as a proxy task in self-supervision. The proposed method can help against generalization issues where the agent cannot distinguish between newly introduced objects. Self-supervision is an inevitable and powerful method for taking advantage of the vast amount of unlabeled data.<br />
<br />
[[File:rotation prediction 22.png|500px|center]]<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2]. [note 1]<br />
<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique, unlabelled data is utilized which can avoid incurring the computational expenses of labeling and maintaining a massive data set. Images already contain structural information that can be utilized. Many SSL tasks exist, such as removing a part of the data for the agent to reconstruct the lost part. Other methods include task prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning is mutually beneficial. This has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this, a feed-forward convolutional network <math>f(x)</math> maps either a labeled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. When evaluating the accuracy of the model only the mappings of labelled images by the classifier<math>g</math> will be considered. Whereas when training the model both mappings of labelled and unlabelled images by <math>g</math> and <math>h</math> respectively will be utilized. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain, augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image. In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x}, \hat{y}) </math>, where <math>\hat{x}</math> is the identity mapping of <math>x</math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
To assess the proposed method, several datasets, e.g., Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet, have been employed. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task, the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in the range of 35 based on the hamming distance, which calculates the number of permutations needed to convert the tiled and shuffled image back to its original form.<br />
<br />
== Results ==<br />
An N-way k-shot classification task contains N unique classes with k labeled images per class. The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as a function of the distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
== Source Codes ==<br />
<br />
The source code can be found here: https://github.com/cvl-umass/fsl_ssl .<br />
== Conclusion ==<br />
The authors of this paper provide us with great insight into the effects of using SSL as a regularizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples, and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
I believe that both strength and weakness of this paper is in its experiments. Different experiments compare a variety self-supervised learning algorithms which is a good point. However, as the reviewers also pointed out, there are some concerns including the level of novelty in the work, the way of creating unlabeled pool, and finally employing pre-trained ResNet-101 on ImageNet and mini-ImageNet in their experiments.<br />
<br />
The authors use a multi-task learning approach with self-supervision. But this approach is already used in various tasks, e.g., domain adaptation, semi-supervised learning, training GANs. So, in my opinion, their approach is incremental based on previous works. Moreover, they showed some quite interesting and even surprising results that may need more consideration such as figure 7 in the summary. I can see some of their claims may not match the results.<br />
<br />
== Notes ==<br />
:1. Model-Agnostic Meta-learning (MAML): Neural networks are performing very well at many tasks, but they often require large datasets. On the contrary, humans are able to learn new skills with little examples. MAML is trained with different tasks, which have the role of training sets, and is used to learn new tasks that are like test sets. Therefore, MAML is able to perform well on tasks with small training sets without overfitting to the data.[5]<br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)<br />
<br />
[5]: Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49853This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-10T03:36:46Z<p>A2chanan: /* Results */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in an interpretable manner during classification tasks. The goal of the algorithm is to utilize human-understandable reasoning to perform image classification tasks.<br />
<br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists of dissecting parts of the input images and comparing them to prototypical parts of training images of a given class, thus the expression "this looks like that". Interpretability is crucial in many problems that require understanding how a model makes a particular prediction. Neural networks are typically seen as some of the least interpretable models in machine learning [4]. Medical imaging is one instance where interpretability is critical, where diagnosis using X-ray scans is based on comparing to other prototypical scans. [1]<br />
<br />
== Previous Work ==<br />
Interpretability in deep networks has been a long-sought goal and is a growing field of research. There already exists post-hoc interpretability methods that analyze the performance of a trained CNN, such as [2], but this type of analysis does not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determine parts of the input they are looking at but without associating them to prototypical samples [3].<br />
<br />
== Network Architecture ==<br />
The figure below represents ProtoPNet architecture. The first layers of this network consists of commonly used convolutional layers <math>f</math>, whose parameters are denoted <math>w_{conv}</math>. The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pre-trained on ImageNet which are followed by two additional 1 × 1 convolutional layers. A layer called prototype <math>g_p</math> is a fully connected layer h with weight <math>w_h</math> and no bias that returns the output prediction using a softmax function, unlike all the rest of the layers that use ReLU as the activation function. This network takes in an image <math>x</math> which is propagated through the convolutional layers (<math>f</math> of shape <math>H x W x D</math>) where features are extracted and learns the prototypes P of shape (<math>1 x 1 x D</math>). The number of prototypes <math>m_k</math> is pre-defined for each class <math>k</math> (10 per class in this study). Each prototype will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output <math>z = f(x)</math>, the j-th prototype unit <math>g_{p_{j}}</math> in the prototype layer <math>g_p</math> computes the squared L2 distances between the j-th prototype <math>p_j</math> and all patches of <math>z</math> that have the same shape as <math>p_j</math> and returns the similarity scores. These score values indicate the presence of the prototypical part in the image, while preserving the spatial relation of <math>z</math>. It is possible to up-sample it to the original size in order to obtain a heatmap with different parts that are most similar to the compared prototypes. The scores given by each unit are produced using max-pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, which are then multiplied by the weight matrix <math>w_h</math> in <math>h</math> to produce the output logits as shown in Figure 1.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The network is trained in three stages. First, stochastic gradient descent (SGD) of all but the last layers, followed by projection of the prototypes, and lastly convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other words, it prevents the model from classifying an image to a particular class because it does not have prototypes from other classes.<br />
The optimization problem that they try to solve is:<br />
[[File:CaptureDL.PNG|600px|center]]<br />
<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two examples of the classification task process of images from both datasets and the process of decision making.<br />
<br />
'''Examples of reasoning process:''' <br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process. The paper implemented the model with similar architecture as ALL-CNN-V network and obtains a prediction rate 89.30% in cifar-10 dataset.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtoPNet network was to introduce the interpretability property to neural networks. It is able to dissect images to find prototypical parts. The predictions of an image are made based on a comparison of parts of this image and learned prototypes of each class. One of the greatest advantages of ProtoPNet is that it allows the user to observe the process of how the model is making predictions and therefore understands the reasoning in case of misclassification errors. However, one disadvantage of this network is the addition of another hyperparameter in the form of the number of prototypes.<br />
<br />
== Critique ==<br />
I think that this is a really interesting approach to provide insights as to why a neural network made a certain prediction. Intuitively, based on the architecture, it seems that each convolutional layer learns a certain "aspect" of the image (ie. wheel of a car, the beak of the bird, etc). It would be interesting to see how much further one can take this idea, especially in classifying images of things that appear very similar to the human eye (i.e. various insects).<br />
<br />
== Source code ==<br />
The code for this paper is available at [https://github.com/cfchen-duke/ProtoPNet https://github.com/cfchen-duke/ProtoPNet]<br />
<br />
== Refrences == <br />
[1] C. Chen, O. Li, A. Barnett, J. Su, C. Rudin, This looks like that: deep<br />
learning for interpretable image recognition, arXiv preprint,<br />
arXiv:1806.10574, 2018.<br />
<br />
[1] A. Holt, I. Bichindaritz, R. Schmidt, and P. Perner. Medical applications in case-based reasoning. The<br />
Knowledge Engineering Review, 20:289–292, 09 2005.<br />
<br />
[2] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image<br />
Classification Models and Saliency Maps. In Workshop at the 2nd International Conference on Learning<br />
Representations (ICLR Workshop), 2014<br />
<br />
[3] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative<br />
Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),<br />
pages 2921–2929. IEEE, 2016<br />
<br />
[4] Molnar, Christoph. "Interpretable machine learning. A Guide for Making Black Box Models Explainable", 2019. https://christophm.github.io/interpretable-ml-book/</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=49852BERTScore: Evaluating Text Generation with BERT2020-12-10T03:28:08Z<p>A2chanan: /* Critique & Future Prospects */</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. Finally, the BERTScore is a task-independent evaluation metric which makes it a better choice in comparison to other state of art models. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-Gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. Most of the methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
<div align="center">Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> </div><br />
<br />
<div align="center">Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> </div><br />
<br />
Here <math>I[.]</math> is an indicator function, <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
The most popular n-Gram Matching metric is BLEU (Bilingual Evaluation Understudy). The output for this metric is between 0.0 and 1.0 where a score of 0.0 denotes a perfect mismatch and a score of 1.0 denotes a perfect match between candidate sentence and reference sentence. It follows the underlying principle of n-Gram matching and made the following three modifications to Exact-<math>P_n</math> method: <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
METEOR (Banerjee & Lavie, 2005) computes Exact- <math> P_1 </math> and Exact- <math> R_1 </math> with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, ''running'' may be matched with ''run'' if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types. <br />
<br />
Other categories include Edit-distance-based Metrics which compare two strings by calculating the minimum operations to transform one into the other, Embedding-based metrics which are derive based on an applied embedding space to the strings, and Learned Metrics which construct task specific-metrics using a machine learning approach on a supervised data set. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgements as supervision for each datasets.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT, Roberta, XLNET, and XLM models, which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
=== Experiment & Results ===<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. In addition to the standard evaluation, they have also designed model selection experiments. They used 10K hybrid systems super-sampled from WMT18. They randomly select 100 out of 10K hybrid systems and rank them using the automatic metrics. The evaluation has been done on Machine Translation and Image Captioning tasks.<br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for twelve submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics used in the results are the percentage of captions better than or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately five reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with eight task-agnostic metrics (shown under the Metric column in Table 5) and two task-specific metrics, Semantic Propositional Image Caption Evaluation (SPICE) [8] and Learning to Evaluate Image Caption (LEIC) [3]. Given an input image, LEIC predicts whether a caption is written by a human whereas SPICE makes use of scene graphs parsed from reference and candidate captions to compare the similarity.<br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric, BERTScore, is proposed and outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simpler, easier to use, and more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
BERT can also be used for other Natural Language Processing tasks like text classification, NER and etc. In the NER task, the IOB-NER tagging system was applied to the prediction model. The model and taking system could be found in the SpaCy package and then a performance metrics called through Keras will be efficient enough to evaluate the model. We can observe some drawbacks of this model which includes more memory consumption and higher time complexity as compared to its predecessor BLEU<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting. In the future, the authors should consider scaling the model for a pair of languages where the words are not directly comparable. Also, the model should be able to compare between a bad and the worst output and clearly classify the best output from the available options.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.<br />
<br />
[8] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic propositional image caption evaluation. In ECCV, 2016.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=49851BERTScore: Evaluating Text Generation with BERT2020-12-10T03:19:59Z<p>A2chanan: /* Critique & Future Prospects */</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. Finally, the BERTScore is a task-independent evaluation metric which makes it a better choice in comparison to other state of art models. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-Gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. Most of the methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
<div align="center">Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> </div><br />
<br />
<div align="center">Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> </div><br />
<br />
Here <math>I[.]</math> is an indicator function, <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
The most popular n-Gram Matching metric is BLEU (Bilingual Evaluation Understudy). The output for this metric is between 0.0 and 1.0 where a score of 0.0 denotes a perfect mismatch and a score of 1.0 denotes a perfect match between candidate sentence and reference sentence. It follows the underlying principle of n-Gram matching and made the following three modifications to Exact-<math>P_n</math> method: <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
METEOR (Banerjee & Lavie, 2005) computes Exact- <math> P_1 </math> and Exact- <math> R_1 </math> with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, ''running'' may be matched with ''run'' if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types. <br />
<br />
Other categories include Edit-distance-based Metrics which compare two strings by calculating the minimum operations to transform one into the other, Embedding-based metrics which are derive based on an applied embedding space to the strings, and Learned Metrics which construct task specific-metrics using a machine learning approach on a supervised data set. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgements as supervision for each datasets.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT, Roberta, XLNET, and XLM models, which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
=== Experiment & Results ===<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. In addition to the standard evaluation, they have also designed model selection experiments. They used 10K hybrid systems super-sampled from WMT18. They randomly select 100 out of 10K hybrid systems and rank them using the automatic metrics. The evaluation has been done on Machine Translation and Image Captioning tasks.<br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for twelve submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics used in the results are the percentage of captions better than or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately five reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with eight task-agnostic metrics (shown under the Metric column in Table 5) and two task-specific metrics, Semantic Propositional Image Caption Evaluation (SPICE) [8] and Learning to Evaluate Image Caption (LEIC) [3]. Given an input image, LEIC predicts whether a caption is written by a human whereas SPICE makes use of scene graphs parsed from reference and candidate captions to compare the similarity.<br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric, BERTScore, is proposed and outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simpler, easier to use and more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
BERT can also be used for other Natural Language Processing tasks like text classification, NER and etc. In the NER task, the IOB-NER tagging system was applied to the prediction model. The model and taking system could be found in the SpaCy package and then a performance metrics called through Keras will be efficient enough to evaluate the model.<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting. In the future, the authors should consider scaling the model for a pair of languages where the words are not directly comparable. Also, the model should be able to compare between a bad and the worst output and clearly classify the best output from the available options.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.<br />
<br />
[8] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic propositional image caption evaluation. In ECCV, 2016.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=49850BERTScore: Evaluating Text Generation with BERT2020-12-10T03:13:19Z<p>A2chanan: /* Machine Translation */</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. Finally, the BERTScore is a task-independent evaluation metric which makes it a better choice in comparison to other state of art models. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-Gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. Most of the methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
<div align="center">Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> </div><br />
<br />
<div align="center">Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> </div><br />
<br />
Here <math>I[.]</math> is an indicator function, <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
The most popular n-Gram Matching metric is BLEU (Bilingual Evaluation Understudy). The output for this metric is between 0.0 and 1.0 where a score of 0.0 denotes a perfect mismatch and a score of 1.0 denotes a perfect match between candidate sentence and reference sentence. It follows the underlying principle of n-Gram matching and made the following three modifications to Exact-<math>P_n</math> method: <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
METEOR (Banerjee & Lavie, 2005) computes Exact- <math> P_1 </math> and Exact- <math> R_1 </math> with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, ''running'' may be matched with ''run'' if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types. <br />
<br />
Other categories include Edit-distance-based Metrics which compare two strings by calculating the minimum operations to transform one into the other, Embedding-based metrics which are derive based on an applied embedding space to the strings, and Learned Metrics which construct task specific-metrics using a machine learning approach on a supervised data set. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgements as supervision for each datasets.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT, Roberta, XLNET, and XLM models, which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
=== Experiment & Results ===<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. In addition to the standard evaluation, they have also designed model selection experiments. They used 10K hybrid systems super-sampled from WMT18. They randomly select 100 out of 10K hybrid systems and rank them using the automatic metrics. The evaluation has been done on Machine Translation and Image Captioning tasks.<br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for twelve submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics used in the results are the percentage of captions better than or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately five reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with eight task-agnostic metrics (shown under the Metric column in Table 5) and two task-specific metrics, Semantic Propositional Image Caption Evaluation (SPICE) [8] and Learning to Evaluate Image Caption (LEIC) [3]. Given an input image, LEIC predicts whether a caption is written by a human whereas SPICE makes use of scene graphs parsed from reference and candidate captions to compare the similarity.<br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric, BERTScore, is proposed and outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simpler, easier to use and more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
BERT can also be used for other Natural Language Processing tasks like text classification, NER and etc. In the NER task, the IOB-NER tagging system was applied to the prediction model. The model and taking system could be found in the SpaCy package and then a performance metrics called through Keras will be efficient enough to evaluate the model.<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.<br />
<br />
[8] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic propositional image caption evaluation. In ECCV, 2016.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=49849BERTScore: Evaluating Text Generation with BERT2020-12-10T03:08:08Z<p>A2chanan: /* Machine Translation */</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. Finally, the BERTScore is a task-independent evaluation metric which makes it a better choice in comparison to other state of art models. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-Gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. Most of the methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
<div align="center">Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> </div><br />
<br />
<div align="center">Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> </div><br />
<br />
Here <math>I[.]</math> is an indicator function, <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
The most popular n-Gram Matching metric is BLEU (Bilingual Evaluation Understudy). The output for this metric is between 0.0 and 1.0 where a score of 0.0 denotes a perfect mismatch and a score of 1.0 denotes a perfect match between candidate sentence and reference sentence. It follows the underlying principle of n-Gram matching and made the following three modifications to Exact-<math>P_n</math> method: <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
METEOR (Banerjee & Lavie, 2005) computes Exact- <math> P_1 </math> and Exact- <math> R_1 </math> with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, ''running'' may be matched with ''run'' if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types. <br />
<br />
Other categories include Edit-distance-based Metrics which compare two strings by calculating the minimum operations to transform one into the other, Embedding-based metrics which are derive based on an applied embedding space to the strings, and Learned Metrics which construct task specific-metrics using a machine learning approach on a supervised data set. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgements as supervision for each datasets.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT, Roberta, XLNET, and XLM models, which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
=== Experiment & Results ===<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. In addition to the standard evaluation, they have also designed model selection experiments. They used 10K hybrid systems super-sampled from WMT18. They randomly select 100 out of 10K hybrid systems and rank them using the automatic metrics. The evaluation has been done on Machine Translation and Image Captioning tasks.<br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for twelve submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics used in the results are the percentage of captions better than or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately five reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with eight task-agnostic metrics (shown under the Metric column in Table 5) and two task-specific metrics, Semantic Propositional Image Caption Evaluation (SPICE) [8] and Learning to Evaluate Image Caption (LEIC) [3]. Given an input image, LEIC predicts whether a caption is written by a human whereas SPICE makes use of scene graphs parsed from reference and candidate captions to compare the similarity.<br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric, BERTScore, is proposed and outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simpler, easier to use and more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
BERT can also be used for other Natural Language Processing tasks like text classification, NER and etc. In the NER task, the IOB-NER tagging system was applied to the prediction model. The model and taking system could be found in the SpaCy package and then a performance metrics called through Keras will be efficient enough to evaluate the model.<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.<br />
<br />
[8] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic propositional image caption evaluation. In ECCV, 2016.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=49848BERTScore: Evaluating Text Generation with BERT2020-12-10T02:55:42Z<p>A2chanan: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either uses n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. Finally, the BERTScore is a task-independent evaluation metric which makes it a better choice in comparison to other state of art models. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-Gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. Most of the methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
<div align="center">Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> </div><br />
<br />
<div align="center">Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> </div><br />
<br />
Here <math>I[.]</math> is an indicator function, <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
The most popular n-Gram Matching metric is BLEU (Bilingual Evaluation Understudy). The output for this metric is between 0.0 and 1.0 where a score of 0.0 denotes a perfect mismatch and a score of 1.0 denotes a perfect match between candidate sentence and reference sentence. It follows the underlying principle of n-Gram matching and made the following three modifications to Exact-<math>P_n</math> method: <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
METEOR (Banerjee & Lavie, 2005) computes Exact- <math> P_1 </math> and Exact- <math> R_1 </math> with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, ''running'' may be matched with ''run'' if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types. <br />
<br />
Other categories include Edit-distance-based Metrics which compare two strings by calculating the minimum operations to transform one into the other, Embedding-based metrics which are derive based on an applied embedding space to the strings, and Learned Metrics which construct task specific-metrics using a machine learning approach on a supervised data set. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgements as supervision for each datasets.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT, Roberta, XLNET, and XLM models, which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
=== Experiment & Results ===<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. In addition to the standard evaluation, they have also designed model selection experiments. They used 10K hybrid systems super-sampled from WMT18. They randomly select 100 out of 10K hybrid systems and rank them using the automatic metrics. The evaluation has been done on Machine Translation and Image Captioning tasks.<br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for twelve submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics used in the results are the percentage of captions better than or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately five reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with eight task-agnostic metrics (shown under the Metric column in Table 5) and two task-specific metrics, Semantic Propositional Image Caption Evaluation (SPICE) [8] and Learning to Evaluate Image Caption (LEIC) [3]. Given an input image, LEIC predicts whether a caption is written by a human whereas SPICE makes use of scene graphs parsed from reference and candidate captions to compare the similarity.<br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric, BERTScore, is proposed and outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simpler, easier to use and more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
BERT can also be used for other Natural Language Processing tasks like text classification, NER and etc. In the NER task, the IOB-NER tagging system was applied to the prediction model. The model and taking system could be found in the SpaCy package and then a performance metrics called through Keras will be efficient enough to evaluate the model.<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.<br />
<br />
[8] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic propositional image caption evaluation. In ECCV, 2016.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=49847GradientLess Descent2020-12-10T02:06:16Z<p>A2chanan: /* Critiques */</p>
<hr />
<div>== Presented By ==<br />
Jose Avilez<br />
==Introduction==<br />
<br />
In this presentation, we are interested in minimising a smooth convex function without ever computing its derivatives.<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, then this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">\textbf{H}(f)</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of an <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available, or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound. Here the latent dimension is <math display="inline"> k </math> and <math display="inline"> k < n </math>, where <math display="inline"> n </math> is the input dimension.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithm is given in the picture below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
Observe that at each step, we perform a binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.<br />
<br />
===Proof of correctness===<br />
<br />
The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.<br />
<br />
'''Theorem 1'''<br />
<br />
Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>.<br />
<br />
<br />
Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm.<br />
<br />
'''Theorem 2'''<br />
<br />
<math display="inline"> \forall x \in K</math> s.t. <math display="inline">\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2]</math>, we can find integers <math display="inline">0 \leq k_1, k_2 < \log \frac{C_2}{C_1}</math> such that if <math display="inline">r = 2^{k_1}C_1</math> or <math display="inline">r = 2^{-k_2}C_2</math>, then a sample <math display="inline">y</math> from the uniform distribution on <math display="inline">B_x = B\left( x, \frac{r}{\sqrt{n}} \right) </math> satisfies<br />
\begin{align*}<br />
f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right)<br />
\end{align*}<br />
with probability at least <math display="inline">\frac{1}{4}</math>.<br />
<br />
<br />
Notice how the second theorem implies that with a quarter probability, <math display="inline">f(y)</math> is closer to the optimum,<math display="inline"> f(x^*), </math> than <math display="inline">f(x)</math> is.<br />
<br />
For proof of these theorems, please watch my talk.<br />
<br />
[[File: GLD2.PNG|frame| Gradientless Descent with Fast Binary Search.]]<br />
<br />
In the current form of GradientLess Descent Algorithm presented here, the lower and upper limits of the search radius i.e. <math display="inline">[r, R]</math> remain unchanged for the entire run of the algorithm. As proven by the correctness of this algorithm, this does ensure convergence but this version of the algorithm does not take advantage of the upper bound of the condition number <math display="inline">Q</math> and therefore, has an extra factor of <math display="inline">\log \frac{1}{\epsilon}</math> in its overall cost.<br />
<br />
A variation of this algorithm termed '''Gradientless Descent with Fast Binary Search (GLD-Fast)''', eliminates this additional factor from the overall cost through reduction in the range of the binary search by shrinking <math display="inline">R</math> in half after every <math display="inline">H</math> iterations (where <math display="inline">H</math> is determined by <math display="inline">Q</math>).<br />
<br />
<br />
<br />
For determining K and H, use the following equations:<br />
<br />
K = log(4√Q)<br />
<br />
H = nQ log(Q)<br />
<br />
==Results==<br />
<br />
We compare the GradientLess Descent algorithm to a benchmark established by the Augmented Randomised Search algorithm proposed in 2011.<br />
<br />
[[File:GLDBeatsARS.PNG|1000px|]]<br />
<br />
For this comparison, we defined the function <math display="inline">f(x) = \frac{1}{2} x^T H x </math> where <math display="inline">H</math> is a diagonal matrix with eigenvalues linearly interpolating the interval <math display="inline">[\alpha , \beta]</math>. We observe that in most scenarios, GradientLess Descent beats the benchmark.<br />
<br />
==Conclusion==<br />
This research paper has analysed a randomised algorithm where a search direction is sampled from the standard Gaussian. This is a direct search-based algorithm, which yields the convergence rate that is polylogarithmically dependent on dimensionality for any monotone transform of a smooth and strongly convex objective with a low-dimensional structure. In this algorithm, the step-size is considered as an approximate line to search all the possible values of a grid spanning an interval with uniform spacing on a log-scale. They show a geometric decrease of the function value regret, up to a constant defined by the minimum step-size, on strongly convex functions with Lipschitz smooth gradient.<br />
<br />
==Critiques==<br />
<br />
1- Although the theoretical guarantees presented in the paper are interesting, this is not clear how this algorithm is applicable in practice. This is because this paper assumes we do not have access to the objective function, and they are only able to use function evaluations. Besides, there is a strong assumption that the function is smooth and strongly convex. Considering this, my main concern is how we can make sure the objective function is smooth and strongly convex while we do not have access to it explicitly? (if we have explicit access to the function and this is smooth and strongly convex, why shouldn't we use gradient-based methods?!) Further, what happens if the objective function violates the smooth and strongly convex condition? Can we still employ this algorithm?<br />
<br />
In response to the above comments:<br />
<br />
This algorithm has many practical applications especially in the field of reinforcement learning. A major concept in reinforcement learning is the concept of a reward function (which we either wish to minimise or maximise). In particular, the reward function may be hidden behind a black box. For example, consider a "theoretical" slot machine where we only see how much money we get if we win, but do not know how the amount is determined. It is true that in general, these objective functions may not be smooth or strongly convex, but one is able to either make certain assumptions about the reward function or relax certain conditions about the state of the world in order to create a reward function that is smooth or convex. Additionally, certain gradients may not have an analytical form, in which case numerical calculation for gradients may be computationally expensive. This method allows a way to bypass the gradient computations altogether!<br />
<br />
To back the response: <br />
They have demonstrated that their algorithm can be successfully applied to '''MuJoCo''' benchmarks, where the objective function is '''not''' strongly convex and smooth.<br />
- providing more graphical representation in proving lemmas, would make the paper more fathomable.<br />
<br />
2 - Currently only a single synthetic function is observed and experimented within the paper ( It is also under a single monotone exponential transformation ). Using a different function we would be sure the algorithm works well in practice.<br />
<br />
==Bibliography==<br />
<br />
1. Daniel Golovin et al. Gradientless descent: High-dimensional zeroth-order optimisation". In: arXiv preprint arXiv:1911.06317 (2019).<br />
<br />
2. Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap". In: Asian Journal of Mathematics and Statistics 4.1 (2011), pp. 66-70.<br />
<br />
3. Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimisation of convex functions. In: Foundations of Computational Mathematics 17.2 (2017), pp. 527-566.<br />
<br />
4. R Tyrrell Rockafellar. Convex analysis. 28. Princeton university press, 1970.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=49846ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-12-10T01:39:24Z<p>A2chanan: </p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer parameters than BERT-large, but it still produces better results. The changes made to BERT model are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropouts from the model.<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. For example, the BERT_large performs better than BERT_base in all systems all tasks by a significant margin of the average range between 4.5% to 7%. However, at some point, GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce memory consumption, but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass but reduces memory requirements in a gradient checkpoint technique. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that their parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses a transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 (where H is the size of the hidden layer). Next, we explain the changes that have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This choice is not efficient because we may need to have a large hidden layer but not a large vocabulary embedding layer. This issue is a case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However, if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is the size of the vocabulary and is usually quite large. For example, <math display="inline">\\V</math> equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing. For example, one may only share feed-forward network parameters or only share attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two settings for the vocabulary embedding size. In both cases, sharing all the parameters has a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters instead of the attention parameters. Given this, the authors have decided to share all the parameters across the layers, resulting in a much smaller number of parameters, which in turn enable them to have larger hidden layers, which is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al., 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss where positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However, experiments show that NSP is not effective, and it should also be pointed out that NSP loss overlaps with MLM loss in terms of the task in topic prediction. In fact, the necessity of NSP loss has been questioned in the literature (Lample and Conneau,2019; Joshi et al., 2019). The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, the topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal. For example, the model can easily be trained to learn "I love cats" and "I had sushi for lunch" are not coherent as they are already very different topic-wise, but might not be able to tell that "I love cats" and "my mom owned a dog" are not next to each other.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again a binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compares NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to an acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png | center |800px]]<br />
<br />
<br />
'''What does sentence order prediction (SOP) look like?'''<br />
<br />
'''Through a mathematical lens:'''<br />
<br />
First we will present some variable as follows. <math display="inline">\vec{s_{j}}</math> is the <math display="inline">j^{th}</math> textual segment in a document, <math display="inline"> D </math>. Here <math display="inline"> \vec{s_{j}} \in span \{ \vec{w^{j}_1}, ... , \vec{w^{j}_n} \} </math>. <math display="inline"> \vec{w^{j}_i} </math> is the <math display="inline">i^{th}</math> word in <math display="inline">\vec{s_{j}}</math>. Now the task of SOP is given <math display="inline">\vec{s_{k}}</math> to predict whether a following textual segment <math display="inline">\vec{s_{k+1}}</math> is truly the following sentence or not. Here the subscripts <math display="inline">k</math> and <math display="inline">k+1</math> denote the ordering. The task is predict whether <math display="inline">\vec{s_{k+1}}</math> is actually <math display="inline">\vec{s_{j+1}}</math> or <math display="inline">\vec{s_{j}}</math>.<br />
<br />
<br />
'''Through a visual lens:'''<br />
<br />
[[File:SOP.PNG | center | 800px]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. Table 8 below shows the effect of removing dropout. They also observe that the model does not overfit the data even after 1M steps of training. The authors point out that empirical [8] and theoretical [9] evidence suggests that batch normalization in combination with dropout may have harmful results, particularly in convolutional neural networks. They speculate that dropout may be having a similar effect here.<br />
[[File:dropout.png | center |800px]]<br />
<br />
===Effect of Network Depth and Width===<br />
<br />
In table 11, we can see the effect of increasing the number of layers. In all these settings the size of hidden layers is 1024. It appears that with increasing the depth of the model we get better and better results until the number of layers reaches 24. However, it seems that increasing the depth from 24 to 48 will decline the performance of the model.<br />
<br />
[[File:ALBERT_table11.png | center |800px]]<br />
<br />
Table 12 shows the effect of the width of the model. It was observed that the accuracy of the model improved till the width of the network reaches 4096 and after that, any further increase in the width appears to have a decline in the accuracy of the model.<br />
[[File:ALBERT_table12.png | center |800px]]<br />
<br />
Table 13 investigates if we need a very deep model when the model is very wide. It seems that when we have H=4096, the difference between the performance of models with 12 or 24 layers is negligible. <br />
[[File:ALBERT_table13.png | center |800px]]<br />
<br />
These three tables illustrate the logic behind the authors' decisions about the width and depth of the model.<br />
== Source Code ==<br />
<br />
The official source code is available at: https://github.com/google-research/ALBERT<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png | center |800px]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of the same number of steps. Here are the results:<br />
[[File:sameTime.png | center |800px]]<br />
<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better to look at the results when they let the BERT-large be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favour of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that the BERT-large is better at first but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
This paper proposed an embedding factorization to reduce the number of parameters in the embedding dimension, but the authors didn't cite or compare to related approaches. However, this kind of dimensionality reduction has been explored with other techniques, for example for knowledge distillation, quantization, or even adaptive input/softmax.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341<br />
<br />
[6]: Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. 2019. URL https://arxiv.org/abs/1907.10529<br />
<br />
[7]: Guillaume Lample and Alexis Conneau. Crosslingual language model pretraining. 2019. URL https://arxiv.org/abs/1901.07291<br />
<br />
[8]: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.<br />
<br />
[9]: Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690, 2019</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=49845ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-12-10T01:37:47Z<p>A2chanan: </p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer parameters than BERT-large, but it still produces better results. The changes made to BERT model are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropouts from the model.<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. For example, the BERT_large performs better than BERT_base in all systems all tasks by a significant margin of the average range between 4.5% to 7%. However, at some point, GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce memory consumption, but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass but reduces memory requirements in a gradient checkpoint technique. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that their parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses a transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 (where H is the size of the hidden layer). Next, we explain the changes they have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This choice is not efficient because we may need to have a large hidden layer but not a large vocabulary embedding layer. This issue is a case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However, if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is the size of the vocabulary and is usually quite large. For example, <math display="inline">\\V</math> equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing. For example, one may only share feed-forward network parameters or only share attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two settings for the vocabulary embedding size. In both cases, sharing all the parameters has a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters instead of the attention parameters. Given this, the authors have decided to share all the parameters across the layers, resulting in a much smaller number of parameters, which in turn enable them to have larger hidden layers, which is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al., 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss where positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However, experiments show that NSP is not effective, and it should also be pointed out that NSP loss overlaps with MLM loss in terms of the task in topic prediction. In fact, the necessity of NSP loss has been questioned in the literature (Lample and Conneau,2019; Joshi et al., 2019). The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, the topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal. For example, the model can easily be trained to learn "I love cats" and "I had sushi for lunch" are not coherent as they are already very different topic-wise, but might not be able to tell that "I love cats" and "my mom owned a dog" are not next to each other.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again a binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compares NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to an acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png | center |800px]]<br />
<br />
<br />
'''What does sentence order prediction (SOP) look like?'''<br />
<br />
'''Through a mathematical lens:'''<br />
<br />
First we will present some variable as follows. <math display="inline">\vec{s_{j}}</math> is the <math display="inline">j^{th}</math> textual segment in a document, <math display="inline"> D </math>. Here <math display="inline"> \vec{s_{j}} \in span \{ \vec{w^{j}_1}, ... , \vec{w^{j}_n} \} </math>. <math display="inline"> \vec{w^{j}_i} </math> is the <math display="inline">i^{th}</math> word in <math display="inline">\vec{s_{j}}</math>. Now the task of SOP is given <math display="inline">\vec{s_{k}}</math> to predict whether a following textual segment <math display="inline">\vec{s_{k+1}}</math> is truly the following sentence or not. Here the subscripts <math display="inline">k</math> and <math display="inline">k+1</math> denote the ordering. The task is predict whether <math display="inline">\vec{s_{k+1}}</math> is actually <math display="inline">\vec{s_{j+1}}</math> or <math display="inline">\vec{s_{j}}</math>.<br />
<br />
<br />
'''Through a visual lens:'''<br />
<br />
[[File:SOP.PNG | center | 800px]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. Table 8 below shows the effect of removing dropout. They also observe that the model does not overfit the data even after 1M steps of training. The authors point out that empirical [8] and theoretical [9] evidence suggests that batch normalization in combination with dropout may have harmful results, particularly in convolutional neural networks. They speculate that dropout may be having a similar effect here.<br />
[[File:dropout.png | center |800px]]<br />
<br />
===Effect of Network Depth and Width===<br />
<br />
In table 11, we can see the effect of increasing the number of layers. In all these settings the size of hidden layers is 1024. It appears that with increasing the depth of the model we get better and better results until the number of layers reaches 24. However, it seems that increasing the depth from 24 to 48 will decline the performance of the model.<br />
<br />
[[File:ALBERT_table11.png | center |800px]]<br />
<br />
Table 12 shows the effect of the width of the model. It was observed that the accuracy of the model improved till the width of the network reaches 4096 and after that, any further increase in the width appears to have a decline in the accuracy of the model.<br />
[[File:ALBERT_table12.png | center |800px]]<br />
<br />
Table 13 investigates if we need a very deep model when the model is very wide. It seems that when we have H=4096, the difference between the performance of models with 12 or 24 layers is negligible. <br />
[[File:ALBERT_table13.png | center |800px]]<br />
<br />
These three tables illustrate the logic behind the authors' decisions about the width and depth of the model.<br />
== Source Code ==<br />
<br />
The official source code is available at: https://github.com/google-research/ALBERT<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png | center |800px]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of the same number of steps. Here are the results:<br />
[[File:sameTime.png | center |800px]]<br />
<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better to look at the results when they let the BERT-large be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favour of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that the BERT-large is better at first but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
This paper proposed an embedding factorization to reduce the number of parameters in the embedding dimension, but the authors didn't cite or compare to related approaches. However, this kind of dimensionality reduction has been explored with other techniques, for example for knowledge distillation, quantization, or even adaptive input/softmax.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341<br />
<br />
[6]: Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. 2019. URL https://arxiv.org/abs/1907.10529<br />
<br />
[7]: Guillaume Lample and Alexis Conneau. Crosslingual language model pretraining. 2019. URL https://arxiv.org/abs/1901.07291<br />
<br />
[8]: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.<br />
<br />
[9]: Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690, 2019</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43329stat940F212020-11-05T03:26:32Z<p>A2chanan: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43320Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:18:21Z<p>A2chanan: /* Approach */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence: Thw author uses Jensen-Shannon divergence as the loss function where p_augmix1 and p_augmix2 are two chains of augmentation applied<br />
[[File:augmix 3.png|1000px|Image: 1000 pixels]]<br />
<br />
where KL means KL Divergence between porig and paugmix<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruptions.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43319Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:17:25Z<p>A2chanan: /* Approach */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence: Thw author uses Jensen-Shannon divergence as the loss function. <br />
<br />
[[File:augmix 3.png|1000px|Image: 1000 pixels]]<br />
<br />
where KL means KL Divergence between porig and paugmix<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruptions.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43318Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:15:19Z<p>A2chanan: /* Approach */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence: we minimize the Jensen-Shannon divergence among the posterior<br />
distributions of the original sample xorig and its augmented variants. That is, for porig = ˆp(y |<br />
xorig), paugmix1 = ˆp(y | xaugmix1), paugmix2 = ˆp(y|xaugmix2), we replace the original loss L with the loss<br />
L(porig, y) + λ JS(porig; paugmix1; paugmix2).<br />
<br />
[[File:augmix 3.png|1000px|Image: 1000 pixels]]<br />
<br />
where KL means KL Divergence between porig and paugmix<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruptions.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43317Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:14:43Z<p>A2chanan: /* Approach */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence: we minimize the Jensen-Shannon divergence among the posterior<br />
distributions of the original sample xorig and its augmented variants. That is, for porig = ˆp(y |<br />
xorig), paugmix1 = ˆp(y | xaugmix1), paugmix2 = ˆp(y|xaugmix2), we replace the original loss L with the loss<br />
L(porig, y) + λ JS(porig; paugmix1; paugmix2).<br />
<br />
[[File:augmix 3.png|1000px|Image: 1000 pixels]]<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruptions.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:augmix_3.png&diff=43316File:augmix 3.png2020-11-04T10:14:16Z<p>A2chanan: </p>
<hr />
<div></div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43315Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:13:31Z<p>A2chanan: /* Approach */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence: we minimize the Jensen-Shannon divergence among the posterior<br />
distributions of the original sample xorig and its augmented variants. That is, for porig = ˆp(y |<br />
xorig), paugmix1 = ˆp(y | xaugmix1), paugmix2 = ˆp(y|xaugmix2), we replace the original loss L with the loss<br />
L(porig, y) + λ JS(porig; paugmix1; paugmix2).<br />
<br />
<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruptions.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43314Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:12:04Z<p>A2chanan: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruptions.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43313Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:07:57Z<p>A2chanan: /* Approach */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique which mixes randomly generated augmentations and<br />
uses a Jensen-Shannon loss to enforce consistency. Our simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR-10/100-C, ImageNet-C, CIFAR-10/100-P, and ImageNet-P.<br />
AUGMIX models achieve state-of-the-art calibration and can maintain calibration even as the<br />
distribution shifts. We hope that AUGMIX will enable more reliable models, a necessity for models<br />
deployed in safety-critical environments.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:augmix_2.png&diff=43312File:augmix 2.png2020-11-04T10:07:13Z<p>A2chanan: </p>
<hr />
<div></div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43311Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:05:59Z<p>A2chanan: /* Approach */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique which mixes randomly generated augmentations and<br />
uses a Jensen-Shannon loss to enforce consistency. Our simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR-10/100-C, ImageNet-C, CIFAR-10/100-P, and ImageNet-P.<br />
AUGMIX models achieve state-of-the-art calibration and can maintain calibration even as the<br />
distribution shifts. We hope that AUGMIX will enable more reliable models, a necessity for models<br />
deployed in safety-critical environments.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:augmix_1.png&diff=43310File:augmix 1.png2020-11-04T10:05:16Z<p>A2chanan: </p>
<hr />
<div></div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43309Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:04:23Z<p>A2chanan: /* RESUTLS ON CIFAR DATASET */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique which mixes randomly generated augmentations and<br />
uses a Jensen-Shannon loss to enforce consistency. Our simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR-10/100-C, ImageNet-C, CIFAR-10/100-P, and ImageNet-P.<br />
AUGMIX models achieve state-of-the-art calibration and can maintain calibration even as the<br />
distribution shifts. We hope that AUGMIX will enable more reliable models, a necessity for models<br />
deployed in safety-critical environments.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43308Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:04:05Z<p>A2chanan: /* RESULTS ON ImageNet DATASET */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique which mixes randomly generated augmentations and<br />
uses a Jensen-Shannon loss to enforce consistency. Our simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR-10/100-C, ImageNet-C, CIFAR-10/100-P, and ImageNet-P.<br />
AUGMIX models achieve state-of-the-art calibration and can maintain calibration even as the<br />
distribution shifts. We hope that AUGMIX will enable more reliable models, a necessity for models<br />
deployed in safety-critical environments.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43307Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T10:03:44Z<p>A2chanan: /* RESULTS ON ImageNet DATASET */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===RESULTS ON ImageNet DATASET===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique which mixes randomly generated augmentations and<br />
uses a Jensen-Shannon loss to enforce consistency. Our simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR-10/100-C, ImageNet-C, CIFAR-10/100-P, and ImageNet-P.<br />
AUGMIX models achieve state-of-the-art calibration and can maintain calibration even as the<br />
distribution shifts. We hope that AUGMIX will enable more reliable models, a necessity for models<br />
deployed in safety-critical environments.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:imageNet_1.png&diff=43306File:imageNet 1.png2020-11-04T10:02:12Z<p>A2chanan: </p>
<hr />
<div></div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43305stat940F212020-11-04T09:57:13Z<p>A2chanan: </p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43304Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T09:55:21Z<p>A2chanan: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===RESULTS ON ImageNet DATASET===<br />
<br />
== Conclusion ==<br />
AUGMIX is a data processing technique which mixes randomly generated augmentations and<br />
uses a Jensen-Shannon loss to enforce consistency. Our simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR-10/100-C, ImageNet-C, CIFAR-10/100-P, and ImageNet-P.<br />
AUGMIX models achieve state-of-the-art calibration and can maintain calibration even as the<br />
distribution shifts. We hope that AUGMIX will enable more reliable models, a necessity for models<br />
deployed in safety-critical environments.</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43303Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T09:54:38Z<p>A2chanan: /* RESUTLS ON ImageNet DATASET== */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===RESULTS ON ImageNet DATASET===<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43302Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T09:53:57Z<p>A2chanan: /* RESUTLS ON CIFAR DATASET */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===RESUTLS ON ImageNet DATASET=====<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:CIFAR_2.png&diff=43301File:CIFAR 2.png2020-11-04T09:52:00Z<p>A2chanan: </p>
<hr />
<div></div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43300Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T09:48:54Z<p>A2chanan: /* RESUTLS ON CIFAR DATASET */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
<br />
<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
===RESUTLS ON ImageNet DATASET=====<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43299Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T09:48:45Z<p>A2chanan: /* RESUTLS ON CIFAR DATASET */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
<br />
<br />
[[File:CIFAR 1.png|100px|Image: 1000 pixels]]<br />
<br />
===RESUTLS ON ImageNet DATASET=====<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43298Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T09:48:37Z<p>A2chanan: /* RESUTLS ON CIFAR DATASET */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
<br />
<br />
[[File:CIFAR 1.png|100px|Image: 100 pixels]]<br />
<br />
===RESUTLS ON ImageNet DATASET=====<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43297Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T09:47:05Z<p>A2chanan: /* RESUTLS ON CIFAR DATASET== */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
<br />
[[File:CIFAR 1.png]]<br />
<br />
===RESUTLS ON ImageNet DATASET=====<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:CIFAR_1.png&diff=43296File:CIFAR 1.png2020-11-04T09:46:34Z<p>A2chanan: </p>
<hr />
<div></div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43295Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T09:45:11Z<p>A2chanan: /* Experiments and Results */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===RESUTLS ON CIFAR DATASET=====<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
<br />
<br />
===RESUTLS ON ImageNet DATASET=====<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43294Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T08:57:14Z<p>A2chanan: /* Experiments */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments and Results==<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43293Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T08:56:28Z<p>A2chanan: /* Approach */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
The method proposed by the author can be divided into 3 major sections:<br />
1. Augmentations: The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
2. Mixing: The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
3. Jensen-Shannon divergence<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments ==<br />
<br />
<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=43292Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-04T08:50:27Z<p>A2chanan: Created page with "== Presented by == Abhinav Chanana == Introduction == Often a times machine learning algorithms assume that the training data is the correct representation of the data enco..."</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often a times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robust and reduction in accuracy as the models try to fit the noise as well for predictions. A small amount of corruptions has the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019) showing that the classification error rises from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruptions is that it encourages the models or network to memorize the specific corruptions and is unable to generalize the corruptions. The paper also provides evidences that networks trained on translation augmentations are highly sensitive to shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10 , CIFAR100 , ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level , AugMix does some basic augementations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
<br />
<br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
2. CIFAR 100 - https://www.cs.toronto.edu/~kriz/cifar.html<br />
3. ImageNet - http://image-net.org/download<br />
<br />
== Experiments ==<br />
<br />
<br />
<br />
== Conclusion ==</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=42793stat940F212020-10-21T22:33:06Z<p>A2chanan: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || || 1|| || || ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge ||[https://papers.nips.cc/paper/8375-learn-imagine-and-create-text-to-image-generation-from-prior-knowledge.pdf Paper] || ||<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| STRUCTBERT:INCORPORATING LANGUAGE STRUCTURES INTO PRETRAINING FOR DEEP LANGUAGE UNDERSTANDING || [https://openreview.net/pdf?id=BJgQ4lSFPH] || ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Probabilistic Model-Agnostic Meta-Learning || [http://papers.nips.cc/paper/8161-probabilistic-model-agnostic-meta-learning.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Boosting Few-Shot Visual Learning with Self-Supervision || https://openaccess.thecvf.com/content_ICCV_2019/papers/Gidaris_Boosting_Few-Shot_Visual_Learning_With_Self-Supervision_ICCV_2019_paper.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||<br />
|-</div>A2chananhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=42792stat940F212020-10-21T22:32:30Z<p>A2chanan: </p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || || 1|| || || ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge ||[https://papers.nips.cc/paper/8375-learn-imagine-and-create-text-to-image-generation-from-prior-knowledge.pdf Paper] || ||<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| STRUCTBERT:INCORPORATING LANGUAGE STRUCTURES INTO PRETRAINING FOR DEEP LANGUAGE UNDERSTANDING || [https://openreview.net/pdf?id=BJgQ4lSFPH] || ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Probabilistic Model-Agnostic Meta-Learning || [http://papers.nips.cc/paper/8161-probabilistic-model-agnostic-meta-learning.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Boosting Few-Shot Visual Learning with Self-Supervision || https://openaccess.thecvf.com/content_ICCV_2019/papers/Gidaris_Boosting_Few-Shot_Visual_Learning_With_Self-Supervision_ICCV_2019_paper.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||<br />
|-</div>A2chanan