http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=J32edwar&feedformat=atomstatwiki - User contributions [US]2022-05-28T09:52:30ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=49159BERTScore: Evaluating Text Generation with BERT2020-12-04T22:39:06Z<p>J32edwar: /* Previous Work */</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either use n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
METEOR (Banerjee & Lavie, 2005) computes Exact- <math> P_1 </math> and Exact- <math> R_1 </math> with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, ''running'' may be matched with ''run'' if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types. <br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
<div align="center">Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> </div><br />
<br />
<div align="center">Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> </div><br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics which compare two strings by calculating the minimum operations to transform one into the other, Embedding-based metrics which are derive based on an applied embedding space to the strings, and Learned Metrics which construct task specific-metrics using a machine learning approach on a supervised data set. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgements as supervision for each datasets.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_940-Proposal&diff=47302F21-STAT 940-Proposal2020-11-28T16:46:09Z<p>J32edwar: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
<br />
<br />
<br />
Project # 1 Group members:<br />
<br />
McWhannel, Pierre<br />
<br />
Yan, Nicole<br />
<br />
Hussein Salamah, Ahmed <br />
<br />
Title: Dense Retrieval for Conversational Information Seeking <br />
<br />
Description:<br />
One of the recognized problems in Information Retrieval (IR) is the conversational search that attracts much attention in form of Conversational Assistants such as Alexa, Siri and Cortana. The users’ needs are the ultimate goal of conversational search systems, in this context the questions are asked sequentially imposing a multi-turn format as the Conversational Information Seeking (CIS) task. TREC Conversational Assistance Track (CAsT) [3] is a multi-turn conversational search task as it contains a large-scale reusable test collection for sequences of conversational queries. The response of this conversational model is not a list of relevant documents, but it is limited to brief response passages with a length of 1 to 3 sentences in length.<br />
<br />
[[File:Screen Shot 2020-10-09 at 1.33.00 PM.png | 300px | Example Queries in CAsT]]<br />
<br />
In [4], the authors focus on improving open domain question answering by including dense representations for retrieval instead of the traditional methods. They have adopted a simple dual-encoder framework to construct a learnable retriever on large collections. We want to adopt this dense representation for the conversational model in the CAsT task and compare it with the performance of the other approaches in literature. The performance will be indicated by using graded relevance on five point, which are Fails to meet, Slightly meets, Moderately meets, Highly meets, and Fully meets.<br />
<br />
We aim to further improve our system performance by integrating the following techniques:<br />
<br />
• Paragraph-level pre-training tasks: ICT, BFS, and WLP [1]<br />
<br />
• ANCE training: periodically using checkpoints to encode documents, from which the strong negatives close to the relevant document would be used as next training negatives [5]<br />
<br />
In summary, this project is exploratory in nature as we will be trying to use state-of-art Dense Passage Retrieval techniques (based on BERT) [4, 6], in a question answering (QA) problem. Current first-stage-retrieval approaches mainly rely on bag-of-words models. In this project, we hope to explore the feasibility of using state-of-art methods such as BERT. We will first compare how these perform on the TREC CAsT datasets [3] against the results retrieved using BM25. After these first points of comparison we will next explore methods of improving DPR by exploring one or more techniques that are made to improve the performance of DPR. [1, 5].<br />
<br />
References<br />
<br />
[1] Wei-Cheng Chang et al. Pre-training Tasks for Embedding-based Large-scale Retrieval. 2020. arXiv: 2002.03932 [cs.LG].<br />
<br />
[2] Zhuyun Dai and Jamie Callan. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. 2019. arXiv: 1910.10687 [cs.IR].<br />
<br />
[3] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. TREC CAsT 2019: The Conversational Assistance Track Overview. 2020. arXiv: 2003.13624 [cs.IR].<br />
<br />
[4] Vladimir Karpukhin et al. Dense Passage Retrieval for Open-Domain Ques- tion Answering. 2020. arXiv: 2004.04906 [cs.CL].<br />
<br />
[5] Lee Xiong et al. Approximate Nearest Neighbor Negative Contrastive Learn- ing for Dense Text Retrieval. 2020. arXiv: 2007.00808 [cs.IR].<br />
<br />
[6] Jingtao Zhan et al. RepBERT: Contextualized Text Embeddings for First- Stage Retrieval. 2020. arXiv: 2006.15498 [cs.IR].<br />
<br />
<br />
<br />
Project # 2 Group members:<br />
<br />
Singh, Gursimran<br />
<br />
Sharma, Govind<br />
<br />
Chanana, Abhinav<br />
<br />
Title: Quick Text Description using Headline Generation and Text To Image Conversion<br />
<br />
Description: An automatic tool to generate short description based on long textual data is a useful mechanism to share quick information. Most of the current approaches involve summarizing the text using varied deep learning approaches from Transformers to different RNNs. For this project, instead of building a standard text summarizer, we aim to provide two separate utilities for generating a quick description of the text. First, we plan to develop a model that produces a headline for the long textual data, and second, we are intending to generate an image describing the text. <br />
<br />
Headline Generation - Headline generation is a specific case of text summarization where the output is generally a combination of few words that gives an overall outcome from the text. In most cases, text summarization is an unsupervised learning problem. But, for the headline generation, we have the original headlines available in our training dataset that makes it a supervised learning task. We plan to experiment with different Recurrent Neural Networks like LSTMs and GRUs with varied architectures. For model evaluation, we are considering BERTScore using which we can compare the reference headline with the automatically generated headline from the model. We also aim to explore Attention and Transformer Networks for the text (headline) generation. We will make use of the currently available techniques mentioned in the various research papers but also try to develop our own architecture if the previous methods don't reveal reliable results on our dataset. Therefore, this task would primarily fit under the category of application of deep learning to a particular domain, but could also include some components of new algorithm design.<br />
<br />
Text to Image Conversion - Generation or synthesis of images from a short text description is another very interesting application domain in deep learning. One approach for image generation is based on mapping image pixels to specific features as described by the discriminative feature representation of the text. Recurrent Neural Networks have been successfully used in learning such feature representations of text. This approach is difficult to generalize because the recognition of discriminative features for texts in different domains is not an easy task and it requires domain expertise. Different generative methods have been used including Variational Recurrent Auto-Encoders and its extension in Deep Recurrent Attention Writer (DRAW). We plan to experiment with Generative Adversarial Networks (GAN). Application of GANs on domain-specific datasets has been done but we aim to apply different variants of GANs on the Microsoft COCO dataset which has been used in other architectures. The analysis will be focusing on how well GANs are able to generalize when compared to other alternatives on the given dataset.<br />
<br />
Scope - The above models will be trained independently on different datasets. Therefore, for a particular text, only one of the two functionalities will be available.<br />
<br />
<br />
<br />
Project # 3 Group members:<br />
<br />
Sikri, Gaurav<br />
<br />
Bhatia, Jaskirat<br />
<br />
Title: Malware Prediction<br />
<br />
Description: The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways. With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.<br />
<br />
In this project, we plan to predict how likely a machine is to be infected by malware given its current specifications(total 82) like: company name, Firewall status, physical RAM, etc.<br />
<br />
<br />
<br />
Project # 4 Group members:<br />
<br />
Maleki, Danial<br />
<br />
Rasoolijaberi, Maral<br />
<br />
Title: Binary Deep Neural Network for the domain of Pathology<br />
<br />
Description: The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. We want to investigate the possibility of using these types of networks in the domain of histopathology as it has gigapixels images which make the use of them very useful.<br />
<br />
<br />
Project # 5 Group members:<br />
<br />
Jain, Abhinav<br />
<br />
Bathla, Gautam<br />
<br />
Title: lyft-motion-prediction-autonomous-vehicles(Kaggle)(Tentative)<br />
<br />
Description: Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Comments: We are more inclined towards a 3-D object detection project. We are in the process of finding the right problem statement for it and if we are not successful, we will continue with the above Kaggle competition.<br />
<br />
<br />
Project # 6 Group members:<br />
<br />
You, Bowen<br />
<br />
Avilez, Jose<br />
<br />
Mahmoud, Mohammad<br />
<br />
Wu, Mohan<br />
<br />
Title: Deep Learning Models in Volatility Forecasting<br />
<br />
Description: Price forecasting has become a very hot topic in the financial industry in recent years. We are however very interested in the volatility of such financial instruments. We propose a new deep learning architecture or model to predict volatility and apply our model to real life datasets of various financial products. We will analyze our results and compare them to more traditional methods.<br />
<br />
<br />
Project # 7 Group members:<br />
<br />
Chen, Meixi<br />
<br />
Shen, Wenyu<br />
<br />
Title: Through the Lens of Probability Theory: A Comparison Study of Bayesian Deep Learning Methods<br />
<br />
Description: Deep neural networks have been known as black box models, but they can be made less mysterious when adopting a Bayesian approach. From a Bayesian perspective, one is able to assign uncertainty on the weights instead of having single point estimates, which allows for a better interpretability of deep learning models. However, Bayesian deep learning methods are often intractable due an increase amount of parameters and often times don't have as good performance. In this project, we will study different BDL methods such as Bayesian CNN using variational inference and Laplace approximation, with applications on image classification, and we will try to propose improvements where possible.<br />
<br />
<br />
Project # 8 Group members:<br />
<br />
Avilez, Jose<br />
<br />
Title: A functional universal approximation theorem<br />
<br />
Description: In the seminal paper "Approximation by superpositions of a sigmoidal function", Cybenko gave a simple proof using elementary functional analysis that a certain class of functions, called discriminatory functions, serve as valid activation functions for universal neural approximators. The objective of our project is three-fold:<br />
<br />
1) Prove a converse of Cybenko's Universal Approximation Theorem by means of the Stone-Weierstrass theorem<br />
<br />
2) Provide examples and non-examples of Cybenko's discriminatory functions<br />
<br />
3) Construct a neural network for functional data (i.e. data arising in function spaces) and prove a universal approximation theorem for Lp spaces.<br />
<br />
References:<br />
<br />
[1] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.<br />
<br />
[2] Folland, Gerald B. Real analysis: modern techniques and their applications. Vol. 40. John Wiley & Sons, 1999.<br />
<br />
[3] Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences, 4.<br />
<br />
<br />
<br />
Project # 9 Group members:<br />
<br />
Sikaroudi, Milad<br />
<br />
Ashrafi Fashi, Parsa<br />
<br />
Title: Domain Generalization with Model-Agnostic Semantic Features in Histopathology Images<br />
<br />
Description: The performance of conventional deep neural networks tends to degrade in the presence of a domain shift, such as gathering of data from different centers. In this study for the first time we are going to introduce different anatomical sites as a domain shift to see if we can generalize a low-shot anatomical site by means of rich in terms of quantity but from different anatomical site. The hypothesis is that the statistics of retrieval for model trained using episodic domain generalization will not degrade as much as the baseline when there is a domain shift. We also hypothesize that the episodic domain generalization would perform even better than the pure Meta-learning in the presence of domain shift. <br />
<br />
Instead of supervised learning we are going to work in weakly-supervised learning way in which the whole-slide diagnosis labels are only used. <br />
The questions we are going to address are: <br />
<br />
1. How is the performance of a neural network impacted by introducing domain shift (anatomical sites)?<br />
<br />
2. How domain generalization would help for improving generalization performance in the presence of domain shift, while we are in lack of data for a given anatomical site as our target domain: a pure meta-learning approach, episodic domain generalization or training a classifier on pre-trained features?<br />
<br />
<br />
Project # 10 Group members:<br />
<br />
Torabian, Parsa<br />
<br />
Ebrahimi Farsangi, Sina<br />
<br />
Moayyedi, Arash<br />
<br />
Title: Meta-Learning Regularizers for Few-Shot Classification Models<br />
<br />
Our project aims at exploring the effects of self-supervised pre-training on few-shot classification. We draw inspiration from the paper “When Does Self-supervision Improve Few-shot Learning?”[1] where the authors analyse the effects of using the Jigsaw puzzle[2] and rotation tasks as regularizers for training Prototypical Networks[3] and Model-Agnostic Meta-Learning (MAML)[4] networks. <br />
<br />
The introduced paper analyzes the effects of regularizing meta-learning models using self-supervised loss, based on rotation and Jigsaw tasks. It is conventionally thought that one of the reasons MAML and other optimization based meta-learning algorithms work well is due to initializing a network into a task-generalizable state[5]. In this project, we will be looking at the effects of self-supervised pre-training, as presumably it will initialize the network into a better state than random, and potentially improve subsequent meta-learning. We will compare the effects of using self-supervised methods as pre-training, as regularization, and the combination of both. The effects of other self-supervised learning tasks, such as discoloration and flipping, will be studied as well. We will also look at which combination of tasks, whether interlaced or applied sequentially, work better and complement one another. We will evaluate our final results on the Omniglot and Mini-Imagenet datasets. These improvements will later be compared with their application on other few-shot learning methods, including first-order MAML and Matching Networks.<br />
<br />
References:<br />
<br />
[1] https://arxiv.org/abs/1910.03560<br />
<br />
[2] https://arxiv.org/abs/1603.09246<br />
<br />
[3] https://arxiv.org/abs/1703.05175 <br />
<br />
[4] https://arxiv.org/abs/1703.03400<br />
<br />
[5] https://arxiv.org/abs/2003.11539<br />
<br />
<br />
Project # 11 Group Members:<br />
<br />
Shikhar Sakhuja: s2sakhuj@uwaterloo.ca <br />
<br />
Introduction:<br />
<br />
Controller Area Network (CAN bus) is a vehicle bus standard that allows Electronic Control Units (ECU) within an automobile to communicate with each other without the need for a host computer. Modern automobiles might have up to 70 ECUs for various subsystems such as Engine, Transmission, Breaking, etc. The ECUs exchange messages on the CAN bus and allow for a lot of modern vehicle capabilities such as automatic start/stop, electric park brakes, lane detection, collision avoidance, and more. Each message exchanged on the bus is encoded as a 29-bit packet. These 29 bits consist of a combination of Parameter Group Number (PGN), message priority, and the source address of the message. Parameter groups can be, for example, engine temperature which could include coolant temperature, fuel temperature, etc. The PGN itself includes information such as priority, reserved status, data page, and PDU format. Lastly, the source address maps the message to the ECU it originates from. <br />
<br />
Goals:<br />
<br />
(1) This project aims to use messages exchanged on the CAN bus of a Challenger Truck collected by the Embedded Systems Group at the University of Waterloo. The data exists in a temporal format with a new message exchanged periodically. The goals of this project are two folds:<br />
<br />
(2) Predicting the PGN and source address of message N exchanged on the bus, given messages 1 to N-1. We might also explore predicting attributes within the PGN. <br />
Predicting the delay between messages N-1 and N, given the delay between each pair of consecutive messages leading up to message N-1. <br />
<br />
Potential Approach:<br />
<br />
For the first goal, we intend to experiment with RNN models along with Attention modules since they have shown promising results in text generation/prediction. <br />
<br />
The second goal is more of an investigative problem where we intend to use regression techniques powered by Neural Networks to predict delays between messages N-1 and N.<br />
<br />
<br />
<br />
<br />
<br />
Project # 12 Group members:<br />
<br />
Hemati, Sobhan <br />
<br />
Meaney, Cameron <br />
<br />
Title: Representation learning of gigapixel histopathology images using PointNet a permutation invariant neural network<br />
<br />
Description:<br />
<br />
In recent years, there has been a significant growth in the amount of information available in digital pathology archives. This data is valuable because of its potential uses in research, education, and pathologic diagnosis. As a result, representation learning of histopathology whole slide images (WSIs) has attracted significant attention and become an active area of research. Unfortunately, scientific progress with these data have been difficult because of challenges inherent to the data itself. These challenges include highly complex textures of different tissue types, color variations caused by different stainings, and most notably, the size of the images which are often larger than 50,000x50,000 pixels. Additionally, these images are multi-resolution meaning that each WSI may contain images from different zoom levels, primarily 5X, 10X, 20X, and 40X. With the advent of deep learning, there is optimism that these challenges can be overcome. The main challenge in this approach is that the sheer size of the images makes it infeasible (or impossible) to obtain a vector representation for a WSI, which is a necessary step in order to leverage deep learning algorithms. In practice, this is often bypassed by considering ‘patches’ of the WSI of smaller sizes, a set of which is meant to represent the full WSI. This approach lead to a set representation for a WSI. However, unlike traditional image or sequence models, deep networks that process and learn permutation invariant representations from sets is still a developing area of research. Recent attempts at this include Multi-instance Learning Schemes, Deep Set, and Set Transformers. A particularly successful attempt in developing a deep neural network for set representation in called PointNet which was developed for classification and segmentation of 3D objects and point clouds. In PointNet, each set is represented using a set of (x,y,z) coordinates, and the network is designed to learn a permutation invariant global representation for each set and then use this representation for classification or segmentation.<br />
<br />
In this project, we attempt to first extend the PointNet network to a convolutional PointNet network such that it uses a set of image patches rather than (x,y,z) coordinates to learn the universal permutation invariant representation. Then, we attempt improve the representational power of PointNet as a permutation invariant neural network. For the first part, the main challenge is that while PointNet has been designed for processing of sets with the same size, in WSIs, the size of the image and therefore number of patches is not fixed. For this reason, we will need to develop an idea which enables CNN-PointNet to process sets with different sizes. One possible solution is to use fake members to standardize the set size and then remove the effect of these fake members in backpropagation using a masking scheme. For the second part, the PointNet network can be improved in many ways. For example, the rotation matrix used is not a real rotation matrix as the orthogonality is incorporated using a regularization term. However, using a projected gradient technique and the existence of a closed form solution for obtaining nearest orthogonal matrix to a given matrix (Orthogonal Procrustes Problem) we can keep the exact orthogonality constraint and obtain a real rotation matrix. This exact orthogonality is geometrically important as, otherwise, this transformation will likely corrupt the neighborhood structure of the points in each set. Furthermore, PointNet uses very simple symmetric function (max pooling) as a set approximator, however there more powerful symmetric functions like statistical moments, power-sum with a trainable parameter, and other set approximators can be used. It would be interesting to see how more complicated symmetric functions can improve the representational power of PointNet to achieve more discriminative permutation invariant representations for each set (in this case WSIs).<br />
<br />
Project # 13 Group Members:<br />
<br />
Syed Saad Naseem ssnaseem@uwaterloo.ca<br />
<br />
Title: Text classification of topics related to COVID-19 on social media using deep learning<br />
The COVID-19 pandemic has become a public health emergency and a critical socioeconomic issue worldwide. It is changing the way we live and do business. Social media is a rich source of data about public opinion on different types of topics including topics about COVID-19. I plan on using Reddit to get a dataset of posts and comments from users related to COVID-19 and since Reddit is divided into communities so the posts and comments are also clustered by the topic of the community, for example, posts from the political subreddit will have posts about politics.<br />
<br />
I plan to make a classifier that will take a given text and will tell what the text of talking about for example it can be talking about politics, studies, relationships, etc. The goals of this project are to:<br />
<br />
• Scrape a dataset from Reddit from different communities<br />
<br />
• Train a deep learning model (CNN or RNN model) to classify a given text into the possible categories<br />
<br />
• Test the model on posts from social talking about COVID-19<br />
<br />
<br />
<br />
Project # 14 Group members<br />
<br />
Edwards, John<br />
<br />
Title: Click-through Rate Prediction Using Historical User Data<br />
<br />
Click-through Rate (CTR) prediction consists of forecasting a users probability of clicking on a specified target. CTR is used largely by online advertising systems which sell ad space on a cost-per-click pricing model to asses the likenesses of a user clicking on a targeted ad. <br />
<br />
User session logs provides firms with an assortment of individual specific features, a large - number of which are categorical. Additionally, advertisers posses multiple ad candidates each with their own respective features. The challenge of CTR prediction is to design a model which encompass the Interacting effects of these features to produced high quality forecasts and pair users with advertisements with high potential for click conversion. Additionally computational efficiency must balanced with model complexity so that predictions can be done in an online setting throughout the progression of a users session.<br />
<br />
This projects primary objective will be to attempt creating a new Deep Neural Network (DNN) architecture for producing high quality CTR forecasts while also satisfying the aforementioned challenges.<br />
<br />
While many variants of DNN for CTR predictions exists they can differ greatly in application setting. Specifically, the vast majority of models evaluate each user-ad interaction independently. They fail to utlise information contained for each specific users’ historical add impressions. There is only a small subset of models [1,2,4] which have tried to address this by adapting architectures to utilize historical information. This projects focus will be within this application setting exploring new architectures which can better utilise information contained within a users historical behaviour. <br />
<br />
This projects implementation will consist of the following action plan:<br />
Develop a new model architecture inspired by innovations of previous CTR network designs which lacked the ability to adapt their model to utlize a users historical data [4,5].<br />
Use the public benchmark Avito advertising dataset to empirically evaluate the new models performance and compare it against previous state of the art models for this data set. <br />
<br />
References:<br />
<br />
[1] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Li, Li & Liu, Zhaojie & Du, Yanlong. (2019). Click-Through Rate Prediction with the User Memory Network. <br />
<br />
[2] Ouyang, Wentao & Zhang, Xiuwu & Li, Li & Zou, Heng & Xing, Xin & Liu, Zhaojie & Du, Yanlong. (2019). Deep Spatio-Temporal Neural Networks for Click-Through Rate Prediction. 2078-2086. 10.1145/3292500.3330655. <br />
<br />
[3] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Qi, Chao & Liu, Zhaojie & Du, Yanlong. (2019). Representation Learning-Assisted Click-Through Rate Prediction. 4561-4567. 10.24963/ijcai.2019/634. <br />
<br />
[4] Li, Zeyu, Wei Cheng, Yang Chen, H. Chen and W. Wang. “Interpretable Click-Through Rate Prediction through Hierarchical Attention.” Proceedings of the 13th International Conference on Web Search and Data Mining (2020)<br />
<br />
[5] Zhou, Guorui & Gai, Kun & Zhu, Xiaoqiang & Song, Chenru & Fan, Ying & Zhu, Han & Ma, Xiao & Yan, Yanghui & Jin, Junqi & Li, Han. (2018). Deep Interest Network for Click-Through Rate Prediction. 1059-1068. 10.1145/3219819.3219823.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44801Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-16T02:09:10Z<p>J32edwar: /* Introduction */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minimally augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but these defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an inputted image in which the classification remains constant within a bounded set of perturbations around the original inputted image . Certified defenses have thus far considered <math>l_\text{p}</math>-bounded attacks where after labelling an input, if there does not exists an image resulting in a different label that is within the <math>l_\text{p}</math> norm ball of radius <math>\epsilon</math>, centred at the original input, then a certificate is issued. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44800Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-16T02:03:16Z<p>J32edwar: /* Introduction */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a class label for an inputted image in which the classification remains constant within a bounded set of perturbations around the original inputted image . Certified defenses have thus far considered <math>l_\text{p}</math>-bounded attacks where after labelling an input, if there does not exists an image resulting in a different label that is within the <math>l_\text{p}</math> norm ball of radius <math>\epsilon</math>, centred at the original input, then a certificate is issued. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44799Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-16T02:02:19Z<p>J32edwar: /* Introduction */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a class label for an inputted image in which the classification remains constant within a bounded set of perturbations around the original inputted image . Certified defenses have thus far considered <math>l_\text{p}</math>-bounded attacks where after labelling an input, if there does not exists an image resulting in a different label that is within the <math>l_\text{p}</math> norm ball of radius <math>\epsilon</math>, centred at the original input, then a certificate is issued. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44798Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-16T01:57:41Z<p>J32edwar: /* Introduction */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a class label for an inputted image in which the classification remains constant within a bounded set of perturbations around the original inputted image . Certified defenses have thus far focus on deflecting <math>l_\text{p}</math>-bounded attacks where after labelling an input, if there does not exists an image resulting in a different label that is within the <math>l_\text{p}</math> norm ball of radius <math>\epsilon</math>, centered at the original input, then a certificate is issued. The authors note that to date all work on certifiable defences have focused on <math>p</math> = 2 or infinity (Cohen et al., 2019; Gowal et al., 2018; Wong et al., 2018).<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44710Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-15T22:26:28Z<p>J32edwar: /* Approach */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been build that protects neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantees that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image.<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44697When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T21:54:53Z<p>J32edwar: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labelled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. When evaluating the accuracy of the model only the mappings of labelled images by the classifier<math>g</math> will be considered. Whereas when training the model both mappings of labelled and unlabelled images by <math>g</math> and <math>h</math> respectively will be utilized. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image. In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x}, \hat{y}) </math>, where <math>\hat{x}</math> is the identity mapping of <math>x</math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44677When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T21:08:01Z<p>J32edwar: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labelled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image. In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x}, \hat{y}) </math>, where <math>\hat{x}</math> is the identity mapping of <math>x</math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44674When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T21:03:26Z<p>J32edwar: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labelled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image. In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x_i}, \hat{y_i}) </math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44673When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T21:00:31Z<p>J32edwar: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labelled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image <math>\hat{x}</math> . In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x_i}, \hat{y_i}) </math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44672When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T20:59:59Z<p>J32edwar: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labelled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and an unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f(x)</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image <math>\hat{x}</math> . In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x_i}, \hat{y_i}) </math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44671When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T20:59:11Z<p>J32edwar: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labelled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f(x)</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image <math>\hat{x}</math> . In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x_i}, \hat{y_i}) </math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44670When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T20:58:42Z<p>J32edwar: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labelled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will have be applied to the images. The authors consider the augmentation types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f(x)</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image <math>\hat{x}</math> . In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x_i}, \hat{y_i}) </math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44669When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T20:56:56Z<p>J32edwar: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labelled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the unlabelled images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain augmentations will be applied to the images . The authors consider the augmentations types of jigsaw puzzle and rotation.They also compare the effects on accuracy of having the the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The training procedure consists of mapping a labelled image and unlabelled augmented image to separate embeddings using the shared feature backbone of the feed-forward convolutional network <math>f(x)</math>. It is then trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> involving the labelled image embedding and a self-supervised losses term <math>\mathcal{L}_{ss}</math> involving the unlabelled augmented image embedding.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image <math>\hat{x}</math> . In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x_i}, \hat{y_i}) </math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that for the case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44666When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T20:33:58Z<p>J32edwar: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning.<br />
<br />
In this a feed-forward convolutional network <math>f(x)</math> maps either a labelled image or an augmented unlabelled image to an embedding space. Depending on the input type the embedding is then mapped to one of two label spaces by either a classifier <math>g</math> or a function <math>h</math>. <br />
The labelled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the augmented images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. Within this domain a single type of augmentation will be applied to the images . The authors consider the augmentations of jigsaw puzzle and rotation.The authors also compare the effects on accuracy of having the the unlabelled image be an augmentation of the inputted labelled image (i.e <math>\mathcal{D}_s = \mathcal{D}_{ss}</math>) versus having the unlabelled image be an augmentation of a different image (i.e <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>). <br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. . This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The feed-forward convolutional network <math>f(x)</math> is trained using an loss function <math>\mathcal{L}</math> which combines a classification loss term <math>\mathcal{L}_s</math> and a self-supervised losses term <math> \mathcal{L}_{ss}</math>.<br />
<br />
The classification loss <math>\mathcal{L}_s</math> is defined as:<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
Where it is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>.<br />
<br />
<br />
The task prediction loss <math>\mathcal{L}_{ss}</math> utilizes a separate function <math>h</math> which maps the embeddings of unlabelled images to a separate label space. Here a target label <math>\hat{y}</math> will be related to the augmentation that was applied to the unlabelled image <math>\hat{x}</math> . In the case of jigsaw the label will be the indexes of the permutations applied to the original image. In the case of a rotation the label will be the angle of rotation applied to the original image. If we define a set of labelled pairs for the previously unlabelled augmented imaged as, <math> \forall x \in \mathcal{D}_{ss}, x \rightarrow (\hat{x_i}, \hat{y_i}) </math>, then the task prediction loss can then be defined as:<br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
<br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44626When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T18:15:23Z<p>J32edwar: /* Introduction */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper proposes a technique utilizing self-supervised learning (SSL) to improve the generalization of few-shot learned representations on small labelled data sets . <br />
<br />
Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. <br />
<br />
Self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method can help aid against generalization issues where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.<br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. Self-supervised tasks used are jigsaw puzzle and rotation. This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:<br />
<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
It is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>, in the above equations. <br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44601When Does Self-Supervision Improve Few-Shot Learning?2020-11-15T17:27:20Z<p>J32edwar: /* Previous Work */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique unlabelled data is utilized which can avoid incurring the computational expenses of labelling and maintaining a massive data set . Images already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.<br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Combining supervised and self-supervised losses for few-shot learning. Self-supervised tasks used are jigsaw puzzle and rotation. This paper investigates how the performance on the supervised learning task is influenced by the the choice of the self-supervision task.</div><br />
<br />
The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:<br />
<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
It is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>, in the above equations. <br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class as shown in Figure 2. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
[[File:1.png |center|]]<br />
<br />
<div align="center">Figure 2: Used datasets and their base, validation and test splits.</div><br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 3. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 4, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 5 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 6 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 6(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 6(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 7 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 7: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
Moreover, in fig. 8 the authors considered the same domain learning for different examples and they indicated that adding more unlabeled data of the base classes will increase the accuracy. I would be really curious to apply their approach using cross-domain learning where the base and novel classes come from very different domains. I believe it might add some robustness and take accuracy to a different level. Also, comparing the cross-domain with the same-domain learning might add value to their point when they clued that there is no much improvement in the rotation task especially in the flowers example as it is mostly symmetrical. <br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 8: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=44269The Curious Case of Degeneration2020-11-15T00:43:59Z<p>J32edwar: /* Nucleus Sampling */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies which rely on truncating the probability distribution of tokens especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=44257The Curious Case of Degeneration2020-11-15T00:06:51Z<p>J32edwar: /* Language Model Decoding */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
This decoding strategy is indeed truncating the long tail of the probability distribution. In order to do that, first, we need to find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. Then set <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math> and rescale the probability distribution with <math>p'</math> and select the tokens from <math>P'</math>. <br />
<math><br />
P'(x|x_{1:i-1}) = \begin{cases}<br />
\frac{P(x|x_{1:i-1})}{p'}, & \mbox{if } x \in V^{(p)} \\<br />
0 & \mbox{if } otherwise<br />
\end{cases}<br />
<br />
</math><br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies which rely on truncating the probability distribution of tokens especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=44256The Curious Case of Degeneration2020-11-15T00:03:17Z<p>J32edwar: /* Language Model Decoding */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<math>P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})</math> <br />
====Nucleus Sampling====<br />
This decoding strategy is indeed truncating the long tail of the probability distribution. In order to do that, first, we need to find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. Then set <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math> and rescale the probability distribution with <math>p'</math> and select the tokens from <math>P'</math>. <br />
<math><br />
P'(x|x_{1:i-1}) = \begin{cases}<br />
\frac{P(x|x_{1:i-1})}{p'}, & \mbox{if } x \in V^{(p)} \\<br />
0 & \mbox{if } otherwise<br />
\end{cases}<br />
<br />
</math><br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies which rely on truncating the probability distribution of tokens especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=44221The Curious Case of Degeneration2020-11-14T22:53:58Z<p>J32edwar: </p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability. <br />
====Nucleus Sampling====<br />
This decoding strategy is indeed truncating the long tail of the probability distribution. In order to do that, first, we need to find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. Then set <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math> and rescale the probability distribution with <math>p'</math> and select the tokens from <math>P'</math>. <br />
<math><br />
P'(x|x_{1:i-1}) = \begin{cases}<br />
\frac{P(x|x_{1:i-1})}{p'}, & \mbox{if } x \in V^{(p)} \\<br />
0 & \mbox{if } otherwise<br />
\end{cases}<br />
<br />
</math><br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies which rely on truncating the probability distribution of tokens especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=44220The Curious Case of Degeneration2020-11-14T22:52:57Z<p>J32edwar: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation,in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability. <br />
====Nucleus Sampling====<br />
This decoding strategy is indeed truncating the long tail of the probability distribution. In order to do that, first, we need to find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. Then set <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math> and rescale the probability distribution with <math>p'</math> and select the tokens from <math>P'</math>. <br />
<math><br />
P'(x|x_{1:i-1}) = \begin{cases}<br />
\frac{P(x|x_{1:i-1})}{p'}, & \mbox{if } x \in V^{(p)} \\<br />
0 & \mbox{if } otherwise<br />
\end{cases}<br />
<br />
</math><br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies which rely on truncating the probability distribution of tokens especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43258From Variational to Deterministic Autoencoders2020-11-03T17:38:10Z<p>J32edwar: /* Presented by */</p>
<hr />
<div>== Presented by == <br />
John Landon Edwards<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{x} D_{\theta}(E_\phi(x)) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space for each of the models.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST, CIFAR, and CELEBA datasets. Their performance across each metric and each dataset can be seen in '''figure 1'''. For the GMM metric and for each dataset, all RAE variants with regularization schemes outperform the baseline models. Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants outperform the baseline models within the CIFAR and CELEBA datasets. This suggests RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>\log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>\log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
There is empirical evidence to support the sample quality of RAES is comparable to VAE’s. The Authors are inconclusive in determining how the different variants of regularization schemes affect the RAE’s performance as there was much variation between them for datasets. They do note they opted to use the L2 version in the structured objects experiment because it was the simplest to implement.<br />
There is also empirical evidence that using the ex-post density estimation when applied to existing VAE frameworks improves their sample quality as seen in the image generation experiment, this offers a plausible way to potentially improve existing VAE architectures. My Overall impression of the paper is they provided substantial evidence that a deterministic autocoder can learn a latent space that is of comparable or better quality than that of a VAE. Although they observe favourable results for their RAE framework, its still far from conclusive whether RAE will perform better in all data domains.A future comparison I would be interested in seeing is with VQ-VAE’s in the domain sound generation.<br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43167stat940F212020-11-02T19:53:28Z<p>J32edwar: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43163stat940F212020-11-02T18:50:41Z<p>J32edwar: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43162stat940F212020-11-02T18:44:11Z<p>J32edwar: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/nWNp_M77D10 Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43161stat940F212020-11-02T18:30:17Z<p>J32edwar: Undo revision 43160 by J32edwar (talk)</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/nWNp_M77D10 Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43160stat940F212020-11-02T18:29:26Z<p>J32edwar: Undo revision 43159 by J32edwar (talk)</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/nWNp_M77D10 Presentation] ||<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43159stat940F212020-11-02T18:28:08Z<p>J32edwar: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/nWNp_M77D10 Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43158stat940F212020-11-02T18:18:10Z<p>J32edwar: </p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/nWNp_M77D10 Presentation] ||<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43094From Variational to Deterministic Autoencoders2020-11-02T04:58:27Z<p>J32edwar: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{x} D_{\theta}(E_\phi(x)) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space for each of the models.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>\log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>\log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
There is empirical evidence to support the sample quality of RAES is comparable to VAE’s. The Authors are inconclusive in determining how the different variants of regularization schemes affect the RAE’s performance as there was much variation between them for datasets. They do note they opted to use the L2 version in the structured objects experiment because it was the simplest to implement.<br />
There is also empirical evidence that using the ex-post density estimation when applied to existing VAE frameworks improves their sample quality as seen in the image generation experiment, this offers a plausible way to potentially improve existing VAE architectures. My Overall impression of the paper is they provided substantial evidence that a deterministic autocoder can learn a latent space that is of comparable or better quality than that of a VAE. Although they observe favourable results for their RAE framework, its still far from conclusive whether RAE will perform better in all data domains.A future comparison I would be interested in seeing is with VQ-VAE’s in the domain sound generation.<br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43093From Variational to Deterministic Autoencoders2020-11-02T04:56:41Z<p>J32edwar: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{x} D_{\theta}(E_\phi(x)) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space for each of the models.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>\log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>\log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
There is empirical evidence to support the sample quality of RAES is comparable to VAE’s. The Authors are inconclusive in determining how the different variants of regularization schemes affect the RAE’s performance as there was much variation between them for datasets. They do note they opted to use the L2 version in the structured objects experiment because it was the simplest to implement.<br />
There is also empirical evidence that using the ex-post density estimation when applied to existing VAE framworks improves sample their sample quality as seen in the image generation experiment, this offers a plausible way for to improve existing VAE architectures. My Overall impression of the paper is they provided substantial evidence that a deterministic autocoder can learn a latent space that is of comparable or better quality than that of a VAE. Although they observe favourable results for their RAE framework, its still far from conclusive whether RAE will perform better in all data domains.A future comparison I would be interested in seeing is with VQ-VAE’s in the domain sound generation.<br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43092stat940F212020-11-02T04:52:53Z<p>J32edwar: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] ||<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Probabilistic Model-Agnostic Meta-Learning || [http://papers.nips.cc/paper/8161-probabilistic-model-agnostic-meta-learning.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43091From Variational to Deterministic Autoencoders2020-11-02T04:10:14Z<p>J32edwar: /* Redesigned Training Loss Function */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{x} D_{\theta}(E_\phi(x)) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space for each of the models.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>\log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>\log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43090From Variational to Deterministic Autoencoders2020-11-02T02:28:13Z<p>J32edwar: /* Metrics of Evaluation: */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space for each of the models.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>\log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>\log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43088From Variational to Deterministic Autoencoders2020-11-02T00:47:59Z<p>J32edwar: /* Metrics of Evaluation */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>\log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>\log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43087From Variational to Deterministic Autoencoders2020-11-02T00:47:33Z<p>J32edwar: /* Metrics of Evaluation */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>\log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43086From Variational to Deterministic Autoencoders2020-11-02T00:33:50Z<p>J32edwar: /* References */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43085From Variational to Deterministic Autoencoders2020-11-01T22:17:41Z<p>J32edwar: /* Redesigned Training Loss Function */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1]- Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] -Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] -Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43069From Variational to Deterministic Autoencoders2020-11-01T10:58:49Z<p>J32edwar: /* Results */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1]- Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] -Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] -Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43068From Variational to Deterministic Autoencoders2020-11-01T10:58:23Z<p>J32edwar: /* Results */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1]- Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] -Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] -Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43067From Variational to Deterministic Autoencoders2020-11-01T10:57:31Z<p>J32edwar: /* Results: */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1]- Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] -Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] -Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43066From Variational to Deterministic Autoencoders2020-11-01T10:56:35Z<p>J32edwar: /* Results: */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1]- Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] -Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] -Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43065From Variational to Deterministic Autoencoders2020-11-01T10:56:23Z<p>J32edwar: /* Results: */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png]]<br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1]- Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] -Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] -Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43064From Variational to Deterministic Autoencoders2020-11-01T10:55:48Z<p>J32edwar: /* Results: */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Img Gen Res.png|center]]<br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1]- Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] -Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] -Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:complex_obj_res.png&diff=43063File:complex obj res.png2020-11-01T10:54:49Z<p>J32edwar: </p>
<hr />
<div></div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Image_Gen_Res.png&diff=43062File:Image Gen Res.png2020-11-01T10:54:31Z<p>J32edwar: </p>
<hr />
<div></div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43061From Variational to Deterministic Autoencoders2020-11-01T10:49:56Z<p>J32edwar: /* Metrics of Evaluation */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1]- Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] -Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] -Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwarhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=43060From Variational to Deterministic Autoencoders2020-11-01T10:49:10Z<p>J32edwar: /* Results: */</p>
<hr />
<div>== Presented by == <br />
Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to Varational Autoencoders (VAEs) titled Regularized Autoencoders (RAEs) for generative modelling which is deterministic.<br />
They investigate how the forcing of an arbitrary prior <math>p(z) </math> within VAEs could be substituted with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilising an ex-post density estimation step that can also be applied to existing VAEs. Finally, They conduct an empirical comparison between VAEs and RAEs to demonstrate the latter are able to generate samples that are comparable or better when applied to domains of images and structured object.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors consider VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> as having a regularization effect whereby it promotes the learning if a smoother latent space. This motivates their exploration of regularization schemes within an auto-encoders loss that could be substituted in place of the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
The removal of random noise injection from VAE's eliminates the ability to sample fro <math>p(z)</math> and in turn produce generated samples. This motivates the authours to fitting a density estimate of the latent post-training so that the sampling mechanism can be reclaimed.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their frame work and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However the RAEs utilize a different loss function and differs in its implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantised-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilise a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally it proposes a ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposal to eliminate the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent varible <math> z </math>.<br />
<br />
The current varational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere.Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results.This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\text{where }\lambda\text{ and }\beta\text{ are hyper parameters}<br />
\end{align}<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{z} D_{\theta}(z) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell}\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
In this process a density estimator <math>q_{\delta}(\mathbf{z})</math> is fit over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math>. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produce but still allowing for generalization to untrained points.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normaliztion. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussianv(CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST,CIFAR ,and CELEBA datasets. Their performance across each metric and each dateset can be seen in '''figure 1'''.For the GMM metric and for each dataset all RAE variants with regualrization schemes outperform the basline models.Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants out preform the baseline models within the CIFAR and CELEBA datasets. This suggest RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
=== Modeling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions .They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions they consider <math>log(1 + MSE)</math> between generated expressions and the true data.To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule.They compare the GRAEs performance on these metrics to those of the GVAE,the constant variance GVAE (GCVVAE) , and the CharacterVAE (CVAE) [2] as seen in '''figure 2'''. Additionally, to asses the behaviour within the latent space they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. Its notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that a deterministic autoencoders is capable of learning a smooth latent space without the requirement of a prior distribution. This allows for circumvention of drawbacks associated with the varational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
<br />
<br />
== References ==<br />
<br />
<br />
[1]- Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] -Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] -Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.</div>J32edwar