http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Msikarou&feedformat=atomstatwiki - User contributions [US]2024-03-29T15:58:42ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_940-Proposal&diff=49784F21-STAT 940-Proposal2020-12-07T18:24:56Z<p>Msikarou: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
<br />
<br />
<br />
Project # 1 Group members:<br />
<br />
McWhannel, Pierre<br />
<br />
Yan, Nicole<br />
<br />
Hussein Salamah, Ahmed <br />
<br />
Title: Dense Retrieval for Conversational Information Seeking <br />
<br />
Description:<br />
One of the recognized problems in Information Retrieval (IR) is the conversational search that attracts much attention in form of Conversational Assistants such as Alexa, Siri and Cortana. The users’ needs are the ultimate goal of conversational search systems, in this context the questions are asked sequentially imposing a multi-turn format as the Conversational Information Seeking (CIS) task. TREC Conversational Assistance Track (CAsT) [3] is a multi-turn conversational search task as it contains a large-scale reusable test collection for sequences of conversational queries. The response of this conversational model is not a list of relevant documents, but it is limited to brief response passages with a length of 1 to 3 sentences in length.<br />
<br />
[[File:Screen Shot 2020-10-09 at 1.33.00 PM.png | 300px | Example Queries in CAsT]]<br />
<br />
In [4], the authors focus on improving open domain question answering by including dense representations for retrieval instead of the traditional methods. They have adopted a simple dual-encoder framework to construct a learnable retriever on large collections. We want to adopt this dense representation for the conversational model in the CAsT task and compare it with the performance of the other approaches in literature. The performance will be indicated by using graded relevance on five point, which are Fails to meet, Slightly meets, Moderately meets, Highly meets, and Fully meets.<br />
<br />
We aim to further improve our system performance by integrating the following techniques:<br />
<br />
• Paragraph-level pre-training tasks: ICT, BFS, and WLP [1]<br />
<br />
• ANCE training: periodically using checkpoints to encode documents, from which the strong negatives close to the relevant document would be used as next training negatives [5]<br />
<br />
In summary, this project is exploratory in nature as we will be trying to use state-of-art Dense Passage Retrieval techniques (based on BERT) [4, 6], in a question answering (QA) problem. Current first-stage-retrieval approaches mainly rely on bag-of-words models. In this project, we hope to explore the feasibility of using state-of-art methods such as BERT. We will first compare how these perform on the TREC CAsT datasets [3] against the results retrieved using BM25. After these first points of comparison we will next explore methods of improving DPR by exploring one or more techniques that are made to improve the performance of DPR. [1, 5].<br />
<br />
References<br />
<br />
[1] Wei-Cheng Chang et al. Pre-training Tasks for Embedding-based Large-scale Retrieval. 2020. arXiv: 2002.03932 [cs.LG].<br />
<br />
[2] Zhuyun Dai and Jamie Callan. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. 2019. arXiv: 1910.10687 [cs.IR].<br />
<br />
[3] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. TREC CAsT 2019: The Conversational Assistance Track Overview. 2020. arXiv: 2003.13624 [cs.IR].<br />
<br />
[4] Vladimir Karpukhin et al. Dense Passage Retrieval for Open-Domain Ques- tion Answering. 2020. arXiv: 2004.04906 [cs.CL].<br />
<br />
[5] Lee Xiong et al. Approximate Nearest Neighbor Negative Contrastive Learn- ing for Dense Text Retrieval. 2020. arXiv: 2007.00808 [cs.IR].<br />
<br />
[6] Jingtao Zhan et al. RepBERT: Contextualized Text Embeddings for First- Stage Retrieval. 2020. arXiv: 2006.15498 [cs.IR].<br />
<br />
<br />
<br />
Project # 2 Group members:<br />
<br />
Singh, Gursimran<br />
<br />
Sharma, Govind<br />
<br />
Chanana, Abhinav<br />
<br />
Title: Quick Text Description using Headline Generation and Text To Image Conversion<br />
<br />
Description: An automatic tool to generate short description based on long textual data is a useful mechanism to share quick information. Most of the current approaches involve summarizing the text using varied deep learning approaches from Transformers to different RNNs. For this project, instead of building a standard text summarizer, we aim to provide two separate utilities for generating a quick description of the text. First, we plan to develop a model that produces a headline for the long textual data, and second, we are intending to generate an image describing the text. <br />
<br />
Headline Generation - Headline generation is a specific case of text summarization where the output is generally a combination of few words that gives an overall outcome from the text. In most cases, text summarization is an unsupervised learning problem. But, for the headline generation, we have the original headlines available in our training dataset that makes it a supervised learning task. We plan to experiment with different Recurrent Neural Networks like LSTMs and GRUs with varied architectures. For model evaluation, we are considering BERTScore using which we can compare the reference headline with the automatically generated headline from the model. We also aim to explore Attention and Transformer Networks for the text (headline) generation. We will make use of the currently available techniques mentioned in the various research papers but also try to develop our own architecture if the previous methods don't reveal reliable results on our dataset. Therefore, this task would primarily fit under the category of application of deep learning to a particular domain, but could also include some components of new algorithm design.<br />
<br />
Text to Image Conversion - Generation or synthesis of images from a short text description is another very interesting application domain in deep learning. One approach for image generation is based on mapping image pixels to specific features as described by the discriminative feature representation of the text. Recurrent Neural Networks have been successfully used in learning such feature representations of text. This approach is difficult to generalize because the recognition of discriminative features for texts in different domains is not an easy task and it requires domain expertise. Different generative methods have been used including Variational Recurrent Auto-Encoders and its extension in Deep Recurrent Attention Writer (DRAW). We plan to experiment with Generative Adversarial Networks (GAN). Application of GANs on domain-specific datasets has been done but we aim to apply different variants of GANs on the Microsoft COCO dataset which has been used in other architectures. The analysis will be focusing on how well GANs are able to generalize when compared to other alternatives on the given dataset.<br />
<br />
Scope - The above models will be trained independently on different datasets. Therefore, for a particular text, only one of the two functionalities will be available.<br />
<br />
<br />
<br />
Project # 3 Group members:<br />
<br />
Sikri, Gaurav<br />
<br />
Bhatia, Jaskirat<br />
<br />
Title: Malware Prediction<br />
<br />
Description: The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways. With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.<br />
<br />
In this project, we plan to predict how likely a machine is to be infected by malware given its current specifications(total 82) like: company name, Firewall status, physical RAM, etc.<br />
<br />
<br />
<br />
Project # 4 Group members:<br />
<br />
Maleki, Danial<br />
<br />
Rasoolijaberi, Maral<br />
<br />
Title: Binary Deep Neural Network for the domain of Pathology<br />
<br />
Description: The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. We want to investigate the possibility of using these types of networks in the domain of histopathology as it has gigapixels images which make the use of them very useful.<br />
<br />
<br />
Project # 5 Group members:<br />
<br />
Jain, Abhinav<br />
<br />
Bathla, Gautam<br />
<br />
Title: lyft-motion-prediction-autonomous-vehicles(Kaggle)(Tentative)<br />
<br />
Description: Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Comments: We are more inclined towards a 3-D object detection project. We are in the process of finding the right problem statement for it and if we are not successful, we will continue with the above Kaggle competition.<br />
<br />
<br />
Project # 6 Group members:<br />
<br />
You, Bowen<br />
<br />
Avilez, Jose<br />
<br />
Mahmoud, Mohammad<br />
<br />
Wu, Mohan<br />
<br />
Title: Deep Learning Models in Volatility Forecasting<br />
<br />
Description: Price forecasting has become a very hot topic in the financial industry in recent years. We are however very interested in the volatility of such financial instruments. We propose a new deep learning architecture or model to predict volatility and apply our model to real life datasets of various financial products. We will analyze our results and compare them to more traditional methods.<br />
<br />
<br />
Project # 7 Group members:<br />
<br />
Chen, Meixi<br />
<br />
Shen, Wenyu<br />
<br />
Title: Through the Lens of Probability Theory: A Comparison Study of Bayesian Deep Learning Methods<br />
<br />
Description: Deep neural networks have been known as black box models, but they can be made less mysterious when adopting a Bayesian approach. From a Bayesian perspective, one is able to assign uncertainty on the weights instead of having single point estimates, which allows for a better interpretability of deep learning models. However, Bayesian deep learning methods are often intractable due an increase amount of parameters and often times don't have as good performance. In this project, we will study different BDL methods such as Bayesian CNN using variational inference and Laplace approximation, with applications on image classification, and we will try to propose improvements where possible.<br />
<br />
<br />
Project # 8 Group members:<br />
<br />
Avilez, Jose<br />
<br />
Title: A functional universal approximation theorem<br />
<br />
Description: In the seminal paper "Approximation by superpositions of a sigmoidal function", Cybenko gave a simple proof using elementary functional analysis that a certain class of functions, called discriminatory functions, serve as valid activation functions for universal neural approximators. The objective of our project is three-fold:<br />
<br />
1) Prove a converse of Cybenko's Universal Approximation Theorem by means of the Stone-Weierstrass theorem<br />
<br />
2) Provide examples and non-examples of Cybenko's discriminatory functions<br />
<br />
3) Construct a neural network for functional data (i.e. data arising in function spaces) and prove a universal approximation theorem for Lp spaces.<br />
<br />
References:<br />
<br />
[1] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.<br />
<br />
[2] Folland, Gerald B. Real analysis: modern techniques and their applications. Vol. 40. John Wiley & Sons, 1999.<br />
<br />
[3] Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences, 4.<br />
<br />
<br />
<br />
Project # 9 Group members:<br />
<br />
Sikaroudi, Milad<br />
<br />
Ashrafi Fashi, Parsa<br />
<br />
Title: '''Magnification Generalization with Model-Agnostic Semantic Features in Histopathology Images'''<br />
<br />
Many of the embedding methods learn the subspace for only a specific magnification. However, one of the main challenges in histopathology image embedding is the different magnification levels for indexing of a Whole Slide Indexing (WSI) image [1]. It is well-known that significantly different patterns may exist at different magnification levels of a WSI [2]. <br />
It is useful to train an embedding space for discriminating the histopathology patches regardless of their magnifications. That would lead to learning more compact WSI representations. It has been an arduous task because of the significant domain shifts between different magnification levels with noticeably different patterns. The performance of conventional deep neural networks tends to degrade in the presence of a domain shift, such as the gathering of data from different centers. In this study for the first time, we are going to introduce different magnification levels as a domain shift to see if we can generalize to in-common features in different magnification levels by means of a domain generalization technique, known as Model Agnostic Learning of Semantic Features. The hypothesis is that the statistics of retrieval for the model trained using episodic domain generalization will not degrade as much as the baseline when there is a domain shift. <br />
<br />
[1] Sellaro, Tiffany L., et al. "Relationship between magnification and resolution in digital pathology systems." Journal of pathology informatics 4 (2013).<br />
<br />
[2] Zaveri, Manit, et al. "Recognizing Magnification Levels in Microscopic Snapshots." arXiv preprint arXiv:2005.03748 (2020).<br />
<br />
<br />
Project # 10 Group members:<br />
<br />
Torabian, Parsa<br />
<br />
Ebrahimi Farsangi, Sina<br />
<br />
Moayyedi, Arash<br />
<br />
Title: Meta-Learning Regularizers for Few-Shot Classification Models<br />
<br />
Our project aims at exploring the effects of self-supervised pre-training on few-shot classification. We draw inspiration from the paper “When Does Self-supervision Improve Few-shot Learning?”[1] where the authors analyse the effects of using the Jigsaw puzzle[2] and rotation tasks as regularizers for training Prototypical Networks[3] and Model-Agnostic Meta-Learning (MAML)[4] networks. <br />
<br />
The introduced paper analyzes the effects of regularizing meta-learning models using self-supervised loss, based on rotation and Jigsaw tasks. It is conventionally thought that one of the reasons MAML and other optimization based meta-learning algorithms work well is due to initializing a network into a task-generalizable state[5]. In this project, we will be looking at the effects of self-supervised pre-training, as presumably it will initialize the network into a better state than random, and potentially improve subsequent meta-learning. We will compare the effects of using self-supervised methods as pre-training, as regularization, and the combination of both. The effects of other self-supervised learning tasks, such as discoloration and flipping, will be studied as well. We will also look at which combination of tasks, whether interlaced or applied sequentially, work better and complement one another. We will evaluate our final results on the Omniglot and Mini-Imagenet datasets. These improvements will later be compared with their application on other few-shot learning methods, including first-order MAML and Matching Networks.<br />
<br />
References:<br />
<br />
[1] https://arxiv.org/abs/1910.03560<br />
<br />
[2] https://arxiv.org/abs/1603.09246<br />
<br />
[3] https://arxiv.org/abs/1703.05175 <br />
<br />
[4] https://arxiv.org/abs/1703.03400<br />
<br />
[5] https://arxiv.org/abs/2003.11539<br />
<br />
<br />
Project # 11 Group Members:<br />
<br />
Shikhar Sakhuja: s2sakhuj@uwaterloo.ca <br />
<br />
Introduction:<br />
<br />
Controller Area Network (CAN bus) is a vehicle bus standard that allows Electronic Control Units (ECU) within an automobile to communicate with each other without the need for a host computer. Modern automobiles might have up to 70 ECUs for various subsystems such as Engine, Transmission, Breaking, etc. The ECUs exchange messages on the CAN bus and allow for a lot of modern vehicle capabilities such as automatic start/stop, electric park brakes, lane detection, collision avoidance, and more. Each message exchanged on the bus is encoded as a 29-bit packet. These 29 bits consist of a combination of Parameter Group Number (PGN), message priority, and the source address of the message. Parameter groups can be, for example, engine temperature which could include coolant temperature, fuel temperature, etc. The PGN itself includes information such as priority, reserved status, data page, and PDU format. Lastly, the source address maps the message to the ECU it originates from. <br />
<br />
Goals:<br />
<br />
(1) This project aims to use messages exchanged on the CAN bus of a Challenger Truck collected by the Embedded Systems Group at the University of Waterloo. The data exists in a temporal format with a new message exchanged periodically. The goals of this project are two folds:<br />
<br />
(2) Predicting the PGN and source address of message N exchanged on the bus, given messages 1 to N-1. We might also explore predicting attributes within the PGN. <br />
Predicting the delay between messages N-1 and N, given the delay between each pair of consecutive messages leading up to message N-1. <br />
<br />
Potential Approach:<br />
<br />
For the first goal, we intend to experiment with RNN models along with Attention modules since they have shown promising results in text generation/prediction. <br />
<br />
The second goal is more of an investigative problem where we intend to use regression techniques powered by Neural Networks to predict delays between messages N-1 and N.<br />
<br />
<br />
<br />
<br />
<br />
Project # 12 Group members:<br />
<br />
Hemati, Sobhan <br />
<br />
Meaney, Cameron <br />
<br />
Title: Representation learning of gigapixel histopathology images using PointNet a permutation invariant neural network<br />
<br />
Description:<br />
<br />
In recent years, there has been a significant growth in the amount of information available in digital pathology archives. This data is valuable because of its potential uses in research, education, and pathologic diagnosis. As a result, representation learning of histopathology whole slide images (WSIs) has attracted significant attention and become an active area of research. Unfortunately, scientific progress with these data have been difficult because of challenges inherent to the data itself. These challenges include highly complex textures of different tissue types, color variations caused by different stainings, and most notably, the size of the images which are often larger than 50,000x50,000 pixels. Additionally, these images are multi-resolution meaning that each WSI may contain images from different zoom levels, primarily 5X, 10X, 20X, and 40X. With the advent of deep learning, there is optimism that these challenges can be overcome. The main challenge in this approach is that the sheer size of the images makes it infeasible (or impossible) to obtain a vector representation for a WSI, which is a necessary step in order to leverage deep learning algorithms. In practice, this is often bypassed by considering ‘patches’ of the WSI of smaller sizes, a set of which is meant to represent the full WSI. This approach lead to a set representation for a WSI. However, unlike traditional image or sequence models, deep networks that process and learn permutation invariant representations from sets is still a developing area of research. Recent attempts at this include Multi-instance Learning Schemes, Deep Set, and Set Transformers. A particularly successful attempt in developing a deep neural network for set representation in called PointNet which was developed for classification and segmentation of 3D objects and point clouds. In PointNet, each set is represented using a set of (x,y,z) coordinates, and the network is designed to learn a permutation invariant global representation for each set and then use this representation for classification or segmentation.<br />
<br />
In this project, we attempt to first extend the PointNet network to a convolutional PointNet network such that it uses a set of image patches rather than (x,y,z) coordinates to learn the universal permutation invariant representation. Then, we attempt improve the representational power of PointNet as a permutation invariant neural network. For the first part, the main challenge is that while PointNet has been designed for processing of sets with the same size, in WSIs, the size of the image and therefore number of patches is not fixed. For this reason, we will need to develop an idea which enables CNN-PointNet to process sets with different sizes. One possible solution is to use fake members to standardize the set size and then remove the effect of these fake members in backpropagation using a masking scheme. For the second part, the PointNet network can be improved in many ways. For example, the rotation matrix used is not a real rotation matrix as the orthogonality is incorporated using a regularization term. However, using a projected gradient technique and the existence of a closed form solution for obtaining nearest orthogonal matrix to a given matrix (Orthogonal Procrustes Problem) we can keep the exact orthogonality constraint and obtain a real rotation matrix. This exact orthogonality is geometrically important as, otherwise, this transformation will likely corrupt the neighborhood structure of the points in each set. Furthermore, PointNet uses very simple symmetric function (max pooling) as a set approximator, however there more powerful symmetric functions like statistical moments, power-sum with a trainable parameter, and other set approximators can be used. It would be interesting to see how more complicated symmetric functions can improve the representational power of PointNet to achieve more discriminative permutation invariant representations for each set (in this case WSIs).<br />
<br />
Project # 13 Group Members:<br />
<br />
Syed Saad Naseem ssnaseem@uwaterloo.ca<br />
<br />
Title: Text classification of topics related to COVID-19 on social media using deep learning<br />
The COVID-19 pandemic has become a public health emergency and a critical socioeconomic issue worldwide. It is changing the way we live and do business. Social media is a rich source of data about public opinion on different types of topics including topics about COVID-19. I plan on using Reddit to get a dataset of posts and comments from users related to COVID-19 and since Reddit is divided into communities so the posts and comments are also clustered by the topic of the community, for example, posts from the political subreddit will have posts about politics.<br />
<br />
I plan to make a classifier that will take a given text and will tell what the text of talking about for example it can be talking about politics, studies, relationships, etc. The goals of this project are to:<br />
<br />
• Scrape a dataset from Reddit from different communities<br />
<br />
• Train a deep learning model (CNN or RNN model) to classify a given text into the possible categories<br />
<br />
• Test the model on posts from social talking about COVID-19<br />
<br />
<br />
<br />
Project # 14 Group members<br />
<br />
Edwards, John<br />
<br />
Title: Click-through Rate Prediction Using Historical User Data<br />
<br />
Click-through Rate (CTR) prediction consists of forecasting a users probability of clicking on a specified target. CTR is used largely by online advertising systems which sell ad space on a cost-per-click pricing model to asses the likenesses of a user clicking on a targeted ad. <br />
<br />
User session logs provides firms with an assortment of individual specific features, a large - number of which are categorical. Additionally, advertisers posses multiple ad candidates each with their own respective features. The challenge of CTR prediction is to design a model which encompass the Interacting effects of these features to produced high quality forecasts and pair users with advertisements with high potential for click conversion. Additionally computational efficiency must balanced with model complexity so that predictions can be done in an online setting throughout the progression of a users session.<br />
<br />
This projects primary objective will be to attempt creating a new Deep Neural Network (DNN) architecture for producing high quality CTR forecasts while also satisfying the aforementioned challenges.<br />
<br />
While many variants of DNN for CTR predictions exists they can differ greatly in application setting. Specifically, the vast majority of models evaluate each user-ad interaction independently. They fail to utlise information contained for each specific users’ historical add impressions. There is only a small subset of models [1,2,4] which have tried to address this by adapting architectures to utilize historical information. This projects focus will be within this application setting exploring new architectures which can better utilise information contained within a users historical behaviour. <br />
<br />
This projects implementation will consist of the following action plan:<br />
Develop a new model architecture inspired by innovations of previous CTR network designs which lacked the ability to adapt their model to utlize a users historical data [4,5].<br />
Use the public benchmark Avito advertising dataset to empirically evaluate the new models performance and compare it against previous state of the art models for this data set. <br />
<br />
References:<br />
<br />
[1] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Li, Li & Liu, Zhaojie & Du, Yanlong. (2019). Click-Through Rate Prediction with the User Memory Network. <br />
<br />
[2] Ouyang, Wentao & Zhang, Xiuwu & Li, Li & Zou, Heng & Xing, Xin & Liu, Zhaojie & Du, Yanlong. (2019). Deep Spatio-Temporal Neural Networks for Click-Through Rate Prediction. 2078-2086. 10.1145/3292500.3330655. <br />
<br />
[3] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Qi, Chao & Liu, Zhaojie & Du, Yanlong. (2019). Representation Learning-Assisted Click-Through Rate Prediction. 4561-4567. 10.24963/ijcai.2019/634. <br />
<br />
[4] Li, Zeyu, Wei Cheng, Yang Chen, H. Chen and W. Wang. “Interpretable Click-Through Rate Prediction through Hierarchical Attention.” Proceedings of the 13th International Conference on Web Search and Data Mining (2020)<br />
<br />
[5] Zhou, Guorui & Gai, Kun & Zhu, Xiaoqiang & Song, Chenru & Fan, Ying & Zhu, Han & Ma, Xiao & Yan, Yanghui & Jin, Junqi & Li, Han. (2018). Deep Interest Network for Click-Through Rate Prediction. 1059-1068. 10.1145/3219819.3219823.</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48501Roberta2020-11-30T19:41:24Z<p>Msikarou: /* Conclusion */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of datasets. These 2 modification categories help them to improve performance on downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction (NSP) as their objectives. The MLM objective randomly selects some of the tokens in the input sequence and replaces them with the special token [MASK]. Then they try to predict these tokens based on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. Selecting positives next sentences is trivial however negative ones are often much more difficult, though originally they used a random next sentence as a negative. They use the Adam optimizer with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modeling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. However, the authors chose to use FULL-SENTENCES since unlike the DOC-SENTENCES consistent batch sizes can be used.<br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE. BPE is hybrid between character and word level modelling based on sub word units. The authors also used a vocabulary size of 50k rather than 30k at the original BERT implementation did, this increases the total parameters by approximately 15M to 20M for the BERT based and BERT large respectively. This change actually results in slight degradation of end-task performance in some cases, however, the authors preferred having the ability to universally encode text without introducing the "unknown" token.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all the aforementioned modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
In conjunction with the modifications of dynamic masking, full-sentences whiteout an NLP loss, large mini-batches, and a larger byte-level BPE.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
RoBERTa has outperformed state of the art in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will achieve the same performances as RoBERTa.<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa]. The original repository for roBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in [https://github.com/Yaozeng/roberta-tf PyTorch to TensorFlow] repository as well.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48331Roberta2020-11-30T05:27:24Z<p>Msikarou: /* Source Code */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modeling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
RoBERTa has outperformed state of the art in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will have achieved the same performances as RoBERTa.<br />
<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa]. The original repository for roBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in [https://github.com/Yaozeng/roberta-tf PyTorch to TensorFlow] repository as well.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48328Roberta2020-11-30T05:22:27Z<p>Msikarou: /* Static vs. Dynamic Masking */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modeling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
RoBERTa has outperformed state of the art in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will have achieved the same performances as RoBERTa.<br />
<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa].<br />
<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48327Roberta2020-11-30T05:22:06Z<p>Msikarou: /* Refrences */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
RoBERTa has outperformed state of the art in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will have achieved the same performances as RoBERTa.<br />
<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa].<br />
<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48326Roberta2020-11-30T05:17:11Z<p>Msikarou: /* The comparison at a glance */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
RoBERTa has outperformed state of the art in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will have achieved the same performances as RoBERTa.<br />
<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa].<br />
<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48321Roberta2020-11-30T05:14:32Z<p>Msikarou: /* The comparison at a glance */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
RoBERTa has outperformed state of the art in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will have achieved the same performances as RoBERTa.<br />
<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|600px|center| from [8] ]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa].<br />
<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48318Roberta2020-11-30T05:11:08Z<p>Msikarou: /* Conclusion */</p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but knowing which part the methods have the most contribution is challenging to determine. Roberta replications BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, what they did can be categorized by (1) they modified some BERT design choices and training schemes. (2) they used a new set of new datasets. These 2 modification categories help them to improve performance on the downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modelling (MLM) and next sentence prediction(NSP) as their objectives. The MLM objectives randomly sampled some of the tokens in the input sequence and replaced them with the special token [MASK]. Then they try to predict these tokens base on the surrounding information. NSP is a binary classification loss for the prediction of whether the two sentences follow each other or not. They use Adam optimization with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As I mentioned in the previous section, the masked language modelling objective in BERT pretraining masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
The next thing they tried to investigate was the necessity of the next sentence prediction objection. They tried different settings to show they would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: Same as the segment-pair representation, just with pairs of sentences. However, the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs in the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. <br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes.<br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all these modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
RoBERTa has outperformed state of the art in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the bellow table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
== Conclusion ==<br />
In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will have achieved the same performances as RoBERTa.<br />
<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|600px|center]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa].<br />
<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:comparison_roberta.png&diff=48317File:comparison roberta.png2020-11-30T05:08:16Z<p>Msikarou: </p>
<hr />
<div></div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=48300orthogonal gradient descent for continual learning2020-11-30T04:43:12Z<p>Msikarou: /* Critique */</p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having the capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject that can help to get acquainted with the subject ([4], [9], [10] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with an increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some important measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. One such algorithm which can be utilized to improve numerical stability is the modified Gram-Schmidt Orthogonalisation. The issue with the simpler Gram-Schmidt algorithm can be seen in the following:<br />
<br />
Suppose we have a matrix <math>A</math> which is to be decomposed into <math>A=\hat{Q}\hat{R}</math> using the Gram-Schmidt algorithm. During the algorithm, columns of <math>\hat{Q}</math> are solved sequentially, where <math>\hat{\vec{q_j}}</math> is the <math>j^{th}</math> column of <math>\hat{Q}</math>, and <math>\hat{r_{ij}}</math> which is the <math>i^{th}</math> row and <math>j^{th}</math> column of <math>\hat{R}</math> are solved from left to right and top to bottom for only the elements <math>\hat{R}</math> to result in a upper triangular matrix. Consider when we are calculating the third column of <math>\hat{Q}</math> as follows: <math>\hat{\vec{q_{3}}}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} - (\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>. <math> \vec{z_3}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} </math> should not have a component in direction <math> \hat{\vec{q_1}}</math>, however, due to numerical stability and catastrophic cancellation [11] this is not always true. The partial result <math>\vec{z_3}</math> ends up having a component in this direction, this leads to a loss in orthogonality in the columns of <math>\hat{Q}</math>. To remedy this problem, the modified Gram-Schmidt algorithm replaces <math>\vec{a_3}</math> with <math>\vec{z_3}</math> in <math>(\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>, this helps in ensuring the orthogonality of the columns of <math>\hat{Q}</math> to any loss of numerical significance since we will be orthogonalizing with the vector which already has the loss of significance.<br />
<br />
<br />
<br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state-of-the-art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). The experiments were performed for the following three continual learning benchmarks: permuted MNIST, rotated MNIST, and split MNIST. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The following tables show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. <br />
<br />
[[File:RMNIST.PNG]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. <br />
<br />
[[File:SMNIST.PNG]]<br />
<br />
Also, the below table corresponds to the performance of Rotated MNIST and Permuted MNIST as a function of number of gradients stored.<br />
<br />
[[File:ogd.png]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not affect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future studies based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly, all the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under task dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when a large learning rate is desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad or even second or even quasi-Newton methods.<br />
<br />
One interesting way for increasing the learning rate can be considering the gradient magnitude of the parameters for data of the former task. If for some specific parameters, the gradient magnitude for data of task A is low then intuitively it means they have not captured a high amount of information from task A. Having this in mind, at least we can increase the learning rate for updating these weights so that we can use them for task B.<br />
<br />
A valuable resource for continual learning is the following GitHub page: [https://github.com/optimass/continual_learning_papers/blob/master/README.md#hybrid-methods link continual_learning_papers]<br />
<br />
== Critique == <br />
The authors proposed an interesting idea for mitigating catastrophic forgetting likely to happen in the online learning setting. Although Orthogonal Gradient Descent achieves state-of-the-art results in practice for continual learning, they have not provided a theoretical guarantee. [12] have derived the first generalization guarantees for the algorithm OGD for continual learning, for overparameterized neural networks. [12] also showed that OGD is only robust to catastrophic forgetting across a single task while for the arbitrary number of tasks they have proposed OGD+.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.<br />
<br />
[11] Wikipedia: https://en.wikipedia.org/wiki/Loss_of_significance<br />
<br />
[12] Bennani, Mehdi Abbana, and Masashi Sugiyama. "Generalisation guarantees for continual learning with orthogonal gradient descent." arXiv preprint arXiv:2006.11942 (2020).</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=48299orthogonal gradient descent for continual learning2020-11-30T04:42:59Z<p>Msikarou: /* References */</p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having the capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject that can help to get acquainted with the subject ([4], [9], [10] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with an increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some important measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. One such algorithm which can be utilized to improve numerical stability is the modified Gram-Schmidt Orthogonalisation. The issue with the simpler Gram-Schmidt algorithm can be seen in the following:<br />
<br />
Suppose we have a matrix <math>A</math> which is to be decomposed into <math>A=\hat{Q}\hat{R}</math> using the Gram-Schmidt algorithm. During the algorithm, columns of <math>\hat{Q}</math> are solved sequentially, where <math>\hat{\vec{q_j}}</math> is the <math>j^{th}</math> column of <math>\hat{Q}</math>, and <math>\hat{r_{ij}}</math> which is the <math>i^{th}</math> row and <math>j^{th}</math> column of <math>\hat{R}</math> are solved from left to right and top to bottom for only the elements <math>\hat{R}</math> to result in a upper triangular matrix. Consider when we are calculating the third column of <math>\hat{Q}</math> as follows: <math>\hat{\vec{q_{3}}}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} - (\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>. <math> \vec{z_3}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} </math> should not have a component in direction <math> \hat{\vec{q_1}}</math>, however, due to numerical stability and catastrophic cancellation [11] this is not always true. The partial result <math>\vec{z_3}</math> ends up having a component in this direction, this leads to a loss in orthogonality in the columns of <math>\hat{Q}</math>. To remedy this problem, the modified Gram-Schmidt algorithm replaces <math>\vec{a_3}</math> with <math>\vec{z_3}</math> in <math>(\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>, this helps in ensuring the orthogonality of the columns of <math>\hat{Q}</math> to any loss of numerical significance since we will be orthogonalizing with the vector which already has the loss of significance.<br />
<br />
<br />
<br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state-of-the-art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). The experiments were performed for the following three continual learning benchmarks: permuted MNIST, rotated MNIST, and split MNIST. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The following tables show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. <br />
<br />
[[File:RMNIST.PNG]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. <br />
<br />
[[File:SMNIST.PNG]]<br />
<br />
Also, the below table corresponds to the performance of Rotated MNIST and Permuted MNIST as a function of number of gradients stored.<br />
<br />
[[File:ogd.png]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not affect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future studies based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly, all the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under task dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when a large learning rate is desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad or even second or even quasi-Newton methods.<br />
<br />
One interesting way for increasing the learning rate can be considering the gradient magnitude of the parameters for data of the former task. If for some specific parameters, the gradient magnitude for data of task A is low then intuitively it means they have not captured a high amount of information from task A. Having this in mind, at least we can increase the learning rate for updating these weights so that we can use them for task B.<br />
<br />
A valuable resource for continual learning is the following GitHub page: [https://github.com/optimass/continual_learning_papers/blob/master/README.md#hybrid-methods link continual_learning_papers]<br />
<br />
== Critique == <br />
The authors proposed an interesting idea for mitigating catastrophic forgetting likely to happen in the online learning setting. Although Orthogonal Gradient Descent achieves state-of-the-art results in practice for continual learning, they have not provided a theoretical guarantee. [] have derived the first generalization guarantees for the algorithm OGD for continual learning, for overparameterized neural networks. [] also showed that OGD is only robust to catastrophic forgetting across a single task while for the arbitrary number of tasks they have proposed OGD+.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.<br />
<br />
[11] Wikipedia: https://en.wikipedia.org/wiki/Loss_of_significance<br />
<br />
[12] Bennani, Mehdi Abbana, and Masashi Sugiyama. "Generalisation guarantees for continual learning with orthogonal gradient descent." arXiv preprint arXiv:2006.11942 (2020).</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=48297orthogonal gradient descent for continual learning2020-11-30T04:42:32Z<p>Msikarou: /* Review */</p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having the capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject that can help to get acquainted with the subject ([4], [9], [10] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with an increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some important measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. One such algorithm which can be utilized to improve numerical stability is the modified Gram-Schmidt Orthogonalisation. The issue with the simpler Gram-Schmidt algorithm can be seen in the following:<br />
<br />
Suppose we have a matrix <math>A</math> which is to be decomposed into <math>A=\hat{Q}\hat{R}</math> using the Gram-Schmidt algorithm. During the algorithm, columns of <math>\hat{Q}</math> are solved sequentially, where <math>\hat{\vec{q_j}}</math> is the <math>j^{th}</math> column of <math>\hat{Q}</math>, and <math>\hat{r_{ij}}</math> which is the <math>i^{th}</math> row and <math>j^{th}</math> column of <math>\hat{R}</math> are solved from left to right and top to bottom for only the elements <math>\hat{R}</math> to result in a upper triangular matrix. Consider when we are calculating the third column of <math>\hat{Q}</math> as follows: <math>\hat{\vec{q_{3}}}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} - (\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>. <math> \vec{z_3}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} </math> should not have a component in direction <math> \hat{\vec{q_1}}</math>, however, due to numerical stability and catastrophic cancellation [11] this is not always true. The partial result <math>\vec{z_3}</math> ends up having a component in this direction, this leads to a loss in orthogonality in the columns of <math>\hat{Q}</math>. To remedy this problem, the modified Gram-Schmidt algorithm replaces <math>\vec{a_3}</math> with <math>\vec{z_3}</math> in <math>(\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>, this helps in ensuring the orthogonality of the columns of <math>\hat{Q}</math> to any loss of numerical significance since we will be orthogonalizing with the vector which already has the loss of significance.<br />
<br />
<br />
<br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state-of-the-art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). The experiments were performed for the following three continual learning benchmarks: permuted MNIST, rotated MNIST, and split MNIST. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The following tables show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. <br />
<br />
[[File:RMNIST.PNG]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. <br />
<br />
[[File:SMNIST.PNG]]<br />
<br />
Also, the below table corresponds to the performance of Rotated MNIST and Permuted MNIST as a function of number of gradients stored.<br />
<br />
[[File:ogd.png]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not affect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future studies based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly, all the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under task dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when a large learning rate is desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad or even second or even quasi-Newton methods.<br />
<br />
One interesting way for increasing the learning rate can be considering the gradient magnitude of the parameters for data of the former task. If for some specific parameters, the gradient magnitude for data of task A is low then intuitively it means they have not captured a high amount of information from task A. Having this in mind, at least we can increase the learning rate for updating these weights so that we can use them for task B.<br />
<br />
A valuable resource for continual learning is the following GitHub page: [https://github.com/optimass/continual_learning_papers/blob/master/README.md#hybrid-methods link continual_learning_papers]<br />
<br />
== Critique == <br />
The authors proposed an interesting idea for mitigating catastrophic forgetting likely to happen in the online learning setting. Although Orthogonal Gradient Descent achieves state-of-the-art results in practice for continual learning, they have not provided a theoretical guarantee. [] have derived the first generalization guarantees for the algorithm OGD for continual learning, for overparameterized neural networks. [] also showed that OGD is only robust to catastrophic forgetting across a single task while for the arbitrary number of tasks they have proposed OGD+.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.<br />
<br />
[11] Wikipedia: https://en.wikipedia.org/wiki/Loss_of_significance</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=48149The Curious Case of Degeneration2020-11-30T02:12:09Z<p>Msikarou: /* Tutorial on beam search */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has a lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have a high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it is useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. Their results show that (1) likelihood maximization decoding causes degeneration, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation, and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== Tutorial on beam search ==<br />
1- [https://youtu.be/Er2ucMxjdHE Greedy Search]<br />
<br />
2- [https://youtu.be/RLWuzLLSIgw?list=PLCSzVeDv57Z1y0uWZXYX2kq5UUpqA0Mk2 Beam Search]<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=48147The Curious Case of Degeneration2020-11-30T02:11:03Z<p>Msikarou: /* Zipf Distribution Analysis */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has a lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have a high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it is useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. Their results show that (1) likelihood maximization decoding causes degeneration, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation, and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== Tutorial on beam search ==<br />
1- [https://youtu.be/Er2ucMxjdHE Greedy Search]<br />
<br />
2- [https://youtu.be/RLWuzLLSIgw?list=PLCSzVeDv57Z1y0uWZXYX2kq5UUpqA0Mk2 Beam search]<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=48145The Curious Case of Degeneration2020-11-30T02:10:51Z<p>Msikarou: /* Likelihood Evaluation */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has a lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have a high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it is useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. Their results show that (1) likelihood maximization decoding causes degeneration, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation, and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== Tutorial on beam search ==<br />
1- [https://youtu.be/Er2ucMxjdHE Greedy Search]<br />
<br />
2- [https://youtu.be/RLWuzLLSIgw?list=PLCSzVeDv57Z1y0uWZXYX2kq5UUpqA0Mk2 Beam search]<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=48144The Curious Case of Degeneration2020-11-30T02:10:34Z<p>Msikarou: /* Nucleus Sampling */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has a lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have a high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. Their results show that (1) likelihood maximization decoding causes degeneration, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation, and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== Tutorial on beam search ==<br />
1- [https://youtu.be/Er2ucMxjdHE Greedy Search]<br />
<br />
2- [https://youtu.be/RLWuzLLSIgw?list=PLCSzVeDv57Z1y0uWZXYX2kq5UUpqA0Mk2 Beam search]<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=48140The Curious Case of Degeneration2020-11-30T02:10:06Z<p>Msikarou: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has a lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have a high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. Their results show that (1) likelihood maximization decoding causes degeneration, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation, and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== Tutorial on beam search ==<br />
1- [https://youtu.be/Er2ucMxjdHE Greedy Search]<br />
<br />
2- [https://youtu.be/RLWuzLLSIgw?list=PLCSzVeDv57Z1y0uWZXYX2kq5UUpqA0Mk2 Beam search]<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=48137The Curious Case of Degeneration2020-11-30T02:08:34Z<p>Msikarou: /* Tutorial on beam search */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. Their results show that (1) likelihood maximization decoding causes degeneration, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation, and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== Tutorial on beam search ==<br />
1- [https://youtu.be/Er2ucMxjdHE Greedy Search]<br />
<br />
2- [https://youtu.be/RLWuzLLSIgw?list=PLCSzVeDv57Z1y0uWZXYX2kq5UUpqA0Mk2 Beam search]<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=48133The Curious Case of Degeneration2020-11-30T02:07:07Z<p>Msikarou: /* Repository */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. Their results show that (1) likelihood maximization decoding causes degeneration, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation, and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== Tutorial on beam search ==<br />
1- [https://www.youtube.com/embed/Er2ucMxjdHE Greedy Search]<br />
<br />
2- [https://www.youtube.com/embed/RLWuzLLSIgw?list=PLCSzVeDv57Z1y0uWZXYX2kq5UUpqA0Mk2 Beam search]<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:greedySearch.gif&diff=48117File:greedySearch.gif2020-11-30T01:56:26Z<p>Msikarou: </p>
<hr />
<div></div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=48094The Curious Case of Degeneration2020-11-30T01:37:01Z<p>Msikarou: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. Their results show that (1) likelihood maximization decoding causes degeneration, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation, and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=48062The Curious Case of Degeneration2020-11-30T01:12:41Z<p>Msikarou: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 1: Text generation examples</div><br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Others have questioned whether a problem with beam search is that by expanding on only the top k tokens in each step of the generation, in later steps it may miss possible sequence that would have resulted in a more probable overall phrase. The authors argue that this isn't an issue for generating natural language as it has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 2: Flat versus peaked distribution of tokens</div><br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Due to this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks and the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability.<br />
<br />
\begin{align}<br />
P(x_{1:m+n})=\prod_{i=1}^{m+n}P(x_i|x_1 \ldots x_{i-1})<br />
\end{align}<br />
<br />
<br />
<br />
====Nucleus Sampling====<br />
The authors propose Nucleus Sampling as a stochastic decoding method where the shape of the the probability distribution determines the set of vocabulary tokens to be sampled.<br />
In this they first find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. They then normalise the subset <math>V^{(p)}</math> into a probability distribution by dividing its elements by <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. These normalized probabilities will then be used for the generation of word samples. This entire process can be viewed as a re-scaling to the original probability distribution in to a new distbition <math>P'</math>. Where: <br />
<br />
\begin{align}<br />
P'(x|x_{1:i-1}) = \begin{cases}\frac{P(x|x_{1:i-1})}{p'}, & x \in V^{(p)} \\ 0, & otherwise \end{cases}<br />
\end{align}<br />
<br />
This decoding strategy is beneficial as it can truncate possible long tails of the original probability distribution. Thus is can then help avoid the associated problem of incoherent samples for phrases generated by long-tailed distributions as previously discussed.<br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size <math>k</math>, <math>V^{(k)} </math>, which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling. <br />
<br />
Intuitively, the difference between Top-k sampling and Nucleus sampling is how they set a threshold of truncation - the former one defines a threshold at which the tail of the probability distribution gets truncated, whereas the latter puts a cap the number of tokens in the vocabulary set. It is noteworthy that thresholding the number of tokens can cause <math>p'</math> to fluctuate greatly at different time steps.<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where <math>t \in (0,1)</math> is the temperature and <math>u_{1:|V|} </math> are logits. <br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
Recent studies have shown that lowering <math>t</math> improves the quality of the generated texts while it decreases diversity. Note that the temperature <math>t</math> controls how conservative the model is, and this analogy comes from thermodynamics, where lower temperature means lower energy states are unlikely to be encountered. Hence, the lower the temperature, the less likely the model is to sample tokens with lower probability.<br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 3: Comparison of perplexity across decoding strategies</div><br />
<br />
====What is Perplexity?====<br />
<br />
Perplexity as previously mentioned is a score that comes from information theory [3]. It is a measure of how well a probabilistic model or distribution predicts a sample. This intuitively leads to it being useful for comparing how competition models explain the same sample or dataset. Perplexity has close ties to information entropy as can be seen in the following discrete formulation of perplexity for a probability distribution.<br />
<br />
:<math>PP(p) := 2^{H(p)}=2^{-\sum_x p(x)\log_2 p(x)}</math><br />
<br />
Here <math>H(p)</math> is the entropy in bits and <math>p(x)</math> is the probability of observing <math>x</math> from the distribution.<br />
<br />
Perplexity in the context of probability models also has close ties to information entropy. The idea here is a model <math>f(x)</math> is fit to data from an unknown probability distribution <math>p(x)</math>. When the model is given test samples which were not used during its construction; the model will assign these samples some probability <math>f(x_i)</math>. Here <math>x_i</math> comes from a test set where <math>i = 1,...,N</math>. The perplexity will be lowest for a model which has high probabilities for the test samples. This can be seen in the following equation:<br />
<br />
:<math>PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)}</math><br />
<br />
Here <math>b</math> is the base and can be any number though commonly 2 is used to represent bits.<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 4: Zipf Distribution Analysis</div><br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<div align="center">Figure 5: Comparison of Self-BLEU for decoding strategies</div><br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies which rely on truncating the probability distribution of tokens especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts and was proposed as a solution that captures the region of confidence of language models effectively.<br />
<br />
== Critiques==<br />
The only problem that I can observe from the nucleus sampling that it is only restricted to words from p-th percentile of the distribution during generation. The probabilities of words for which the cumulative sum exceeds the percentile are rescaled and the sequence is sampled from this subset. I think a probing methodology will make it generate more varied and semantically richer responses.<br />
<br />
The comparison of perplexity scores across varying decoding strategies is interesting in the sense that we see that the Top-k method with k = 10^3 results in a score roughly similar to the Nucleus method with p = 0.95. Which method would be more computationally efficient? It seems that finding the set <math> V^{(p)}</math> is more computationally intensive compared to choosing the top-k outputs each time.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ari-holtzman/degen "official repository"]</span><br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018<br />
<br />
[3]: Perplexity: https://en.wikipedia.org/wiki/Perplexity</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=48038STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-11-30T01:05:13Z<p>Msikarou: /* Comparison between ELMo, GPT, and BERT */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed with it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. The deep bidirectional model is strictly more powerful than the left-to-right, or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pairs of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we are going to compare BERT with previous language models, in particular, ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it can capture more context information than GPT and tends to perform better when context information from both sides is important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture we can have a better understanding of the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that among these three models, only BERT representations are jointly conditioned on both directions' context in all layers.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
== Repository ==<br />
<br />
A github repository for BERT is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== Fun facts ==<br />
<br />
A collection of BERT-related papers published in 2019. The y-axis is the log of the citation count (based on Google Scholar).<br />
[[File:BERT-related.gif|800px|center]]<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=48036STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-11-30T01:04:27Z<p>Msikarou: /* BERT */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed with it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. The deep bidirectional model is strictly more powerful than the left-to-right, or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pairs of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we are going to compare BERT with previous language models, in particular, ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it's able to capture more context information than GPT and tend to perform better when context information from both sides are important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture we can have a better understanding of the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that among these three models, only BERT representations are jointly conditioned on both directions' context in all layers.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
== Repository ==<br />
<br />
A github repository for BERT is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== Fun facts ==<br />
<br />
A collection of BERT-related papers published in 2019. The y-axis is the log of the citation count (based on Google Scholar).<br />
[[File:BERT-related.gif|800px|center]]<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=48034STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-11-30T01:04:08Z<p>Msikarou: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed with it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. Deep bidirectional model is strictly more powerful than the left-to-right, or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pair of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we are going to compare BERT with previous language models, in particular, ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it's able to capture more context information than GPT and tend to perform better when context information from both sides are important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture we can have a better understanding of the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that among these three models, only BERT representations are jointly conditioned on both directions' context in all layers.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
== Repository ==<br />
<br />
A github repository for BERT is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== Fun facts ==<br />
<br />
A collection of BERT-related papers published in 2019. The y-axis is the log of the citation count (based on Google Scholar).<br />
[[File:BERT-related.gif|800px|center]]<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=48029STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-11-30T01:01:38Z<p>Msikarou: /* Repository */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. Deep bidirectional model is strictly more powerful than the left-to-right, or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pair of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we are going to compare BERT with previous language models, in particular, ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it's able to capture more context information than GPT and tend to perform better when context information from both sides are important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture we can have a better understanding of the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that among these three models, only BERT representations are jointly conditioned on both directions' context in all layers.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
== Repository ==<br />
<br />
A github repository for BERT is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== Fun facts ==<br />
<br />
A collection of BERT-related papers published in 2019. The y-axis is the log of the citation count (based on Google Scholar).<br />
[[File:BERT-related.gif|800px|center]]<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:BERT-related.gif&diff=48026File:BERT-related.gif2020-11-30T00:59:57Z<p>Msikarou: </p>
<hr />
<div></div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=48020STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-11-30T00:54:32Z<p>Msikarou: /* Repository */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. Deep bidirectional model is strictly more powerful than the left-to-right, or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pair of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we are going to compare BERT with previous language models, in particular, ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it's able to capture more context information than GPT and tend to perform better when context information from both sides are important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture we can have a better understanding of the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that among these three models, only BERT representations are jointly conditioned on both directions' context in all layers.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
== Repository ==<br />
<br />
A github repository for BERT is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=48018STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-11-30T00:53:39Z<p>Msikarou: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT advanced the state-of-the-art for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate, Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need [1] introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right. Deep bidirectional model is strictly more powerful than the left-to-right, or even the concatenation of the left-to-right and right-to-left models. However, bidirectional conditioning would allow each word to see itself indirectly, which makes the problem trivial. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction(NSP) task to make the model understand the relationship between sentences. In the NSP task, two sentences, A and B are fed to the network to predict whether they are consecutive or not. These pair of sentences in the train data are 50% of the time consecutive (labeled as IsNext) and 50% of the time random sentences from the corpus( labeled as NotNext). Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-training and fine-tuning models. To resolve this mismatch, the 15% of the tokens selected to be predicted are 80% of the time replaced with [MASK], 10% of the time are replaced with a random token, and 10% of the time remain unchanged. <br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we are going to compare BERT with previous language models, in particular, ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it's able to capture more context information than GPT and tend to perform better when context information from both sides are important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture we can have a better understanding of the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that among these three models, only BERT representations are jointly conditioned on both directions' context in all layers.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/brightmart/bert_language_understanding "official repository"]</span><br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=48011From Variational to Deterministic Autoencoders2020-11-30T00:49:52Z<p>Msikarou: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
John Landon Edwards<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to the stochastic Variational Autoencoders (VAEs) that is deterministic named the Regularized Autoencoders (RAEs) for generative modeling. The goal of VAEs is to learn from a large collection of high-dimensional samples to draw new sample from the inferred population distribution. RAEs hope to achieve the same goal without the drawbacks of VAEs in practice. The advantages of RAEs to VAEs are that they are easier to train and simpler. The paper investigates how forcing an arbitrary prior <math>p(z) </math> within VAEs could be substituted instead with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilizing an ex-post density estimation step that can also be applied to existing VAEs. Finally, they conduct an empirical comparison between VAEs and RAEs to demonstrate that the latter are able to generate samples that are comparable or better when applied to domains of images and structured objects.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* the compromise between sample quality and reconstruction quality is poor<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
* learned posterior distribution doesn't match the latent assumption [8]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors note that VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> have a regularization effect because it promotes the learning of a smoother latent space. This motivates their exploration of regularization schemes within an autoencoders loss function which could substitute the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
Due to the deterministic nature of RAES, it is impossible to sample from <math>p(z)</math> to produce generated samples. The authors provide a solution to this problem by fitting a density estimate of the latent post-training to generate new samples.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their framework and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However, the RAEs utilize a different loss function and differs in their implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantized-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilize a different training loss that is non-differentiable.<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to the existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally, it proposes an ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposes eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent variable <math> z </math>.<br />
<br />
The current variational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere. Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results. This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\end{align}<br />
where <math>\lambda</math> and <math>\beta</math> are hyper parameters.<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{x} D_{\theta}(E_\phi(x)) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
Recall that since the autoencoder is no longer stochastic, it may prove to be a challenge to sample from the latent space to generate new samples. However, the author proposes to fit a density estimator <math>q_{\delta}(\mathbf{z})</math> over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math> to solve this problem. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produced but still allowing for generalization to untrained points. It is noteworthy that, even in VAE where one would sample from the prespecified <math>p(z)</math>, the generative mechanism is not perfect either, as often times the posterior <math>q_{\phi}(z)</math> can depart a lot from <math>p(z)</math> and thus the sampled <math>z</math> might fall into regions that the decoder hasn't seen. Therefore, intuitively the use of an estimated density is not likely to be more compromising than <math>p(z)</math> already is in VAE.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normalization. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussian (CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space for each of the models.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Qualitative evaluation for sample quality on MNIST ====<br />
The following figure shows the qualitative evaluation for sample quality for VAEs, WAEs, and RAEs on MNIST. The first figure in the extreme left depicts the reconstructed samples (top row is ground truth) followed by randomly generated samples in the middle and spherical interpolations between two images at the extreme right. <br />
<br />
[[File:Paper4_ImageModeling.png|Paper4_ImageModeling.png|center]]<br />
<br />
These are remarkable results that show that the lack of an explicitly fixed structure on the latent space of the RAE does not impede interpolation quality. <br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST, CIFAR, and CELEBA datasets. Their performance across each metric and each dataset can be seen in '''figure 1'''. For the GMM metric and for each dataset, all RAE variants with regularization schemes outperform the baseline models. Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants outperform the baseline models within the CIFAR and CELEBA datasets. This suggests RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modelling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions. They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment, they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions, they consider <math>\log(1 + MSE)</math> between generated expressions and the true data. To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>\log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule. They compare the GRAEs performance on these metrics to those of the GVAE, the constant variance GVAE (GCVVAE), and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to assess the behavior within the latent space, they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. It is notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that deterministic autoencoders are capable of learning a smooth latent space without the requirement of a prior distribution. This allows for the circumvention of drawbacks associated with the variational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
There is empirical evidence to support the sample quality of RAES is comparable to VAE’s. The Authors are inconclusive in determining how the different variants of regularization schemes affect the RAE’s performance as there was much variation between them for datasets. They do note they opted to use the L2 version in the structured objects experiment because it was the simplest to implement.<br />
There is also empirical evidence that using the ex-post density estimation when applied to existing VAE frameworks improves their sample quality as seen in the image generation experiment, this offers a plausible way to potentially improve existing VAE architectures. My Overall impression of the paper is they provided substantial evidence that a deterministic autoencoder can learn a latent space that is of comparable or better quality than that of a VAE. Although they observe favorable results for their RAE framework, it's still far from conclusive whether RAE will perform better in all data domains. A future comparison I would be interested in seeing is with VQ-VAE’s in the domain of sound generation.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ParthaEth/Regularized_autoencoders-RAE- "official repository"]</span><br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.<br />
<br />
[8] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014.</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=48008Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-30T00:46:02Z<p>Msikarou: /* Critique */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robustness and reduction in their accuracy as the models try to fit the noise for the predictions as well. A few corruptions have the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019), showing that the classification error rose from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruption is that it encourages the models or the network to memorize the specific corruptions and is, therefore, unable to generalize these corruptions. The paper also provides evidence that networks trained on translation augmentations are highly sensitive to the shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10, CIFAR100, ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix_Milad.gif|1000px|Image: 1000 pixels]]<br />
<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
<br />
'''1. Augmentations''': The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
<br />
'''2. Mixing''': The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. The intuition behind using a Dirichlet distribution is that it allows us to sample coefficients from (0, 1) that sum to 1. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
'''3. Jensen-Shannon divergence''': The author augments the original loss function with the Jensen-Shannon divergence loss to enforce stable and consistent output: [[File:loss fn.png]]<br />
<br />
<math>p_\text{orig}</math>, <math>p_\text{augmix1}</math> and <math>p_\text{augmix2}</math> are the posterior distributions of the original input <math>x_\text{orig}</math>, and its augmented variants: <math>x_\text{augmix1}, x_\text{augmix2}</math>, respectively.<br />
<br />
The JS in the above formula means the Jensen-Shannon divergence. It measures the similarities between distributions and is based on KL divergence. However, the Jensen-Shannon divergence is symmetric and can be viewed as a smoothed and normalized version of KL divergence. The JS divergence is particularly helpful when we are comparing multiple distributions.<br />
<br />
[[File:augmix 3.png|1000px|Image: 1000 pixels]]<br />
<br />
where KL means KL Divergence between porig and paugmix<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
For example, the pseudocode can be implemented in '''Python''' as follows:<br />
<syntaxhighlight lang="python"><br />
import numpy as np<br />
def augmix(orig_image, operations, k=3, alpha=1):<br />
aug_image = np.zeros(orig_image.shape)<br />
weights = np.random.dirichlet(np.ones(k)*alpha)<br />
for i in range(k):<br />
op1, op2, op3 = np.random.choice(operations, 3)<br />
chain = np.random.uniform()<br />
if 3*chain < 1:<br />
aug_image += op1(orig_image)<br />
elif 3*chain <2:<br />
aug_image += op2(op1(orig_image))<br />
else:<br />
aug_image += op3(op2(op1(orig_image)))<br />
m = np.random.beta(alpha, alpha)<br />
augmix = m*orig_image + (1-m)*aug_image<br />
return augmix<br />
</syntaxhighlight><br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10: This dataset, along with the CIFAR-100 dataset, are labeled subsets of the 80 million tiny images dataset and were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset is composed of 60000 color images of 32x32 pixels. These images are in 10 classes, with 6000 images per class, 50000 for training, and 10000 for testing. This dataset is used in numerous computer vision journals to compare their algorithms. - https://www.cs.toronto.edu/~kriz/cifar.html<br />
<br />
2. CIFAR 100: The difference between this dataset and the CIFAR-10 dataset is that it includes 100 classes of images with 600 images per each class. These classes are also grouped in 20 super-classes, e.g. the flowers' superclass that contains orchids, poppies, roses, sunflowers, and tulips. - https://www.cs.toronto.edu/~kriz/cifar.html<br />
<br />
3. ImageNet: This dataset aims to obtain at least 1000 images per "synonym set" or "sysnet" in the WordNet hierarchy. WordNet is a large lexical database of English where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. This dataset is currently home to 1.2 million labelled images. - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with a 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
== Source Code ==<br />
The source code is available at: https://github.com/google-research/augmix<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruption.<br />
<br />
<br />
== Critique ==<br />
<br />
Since augmix1 and augmix2 are independent, why did they use JS divergence over the mixture of the three? What happened if they only used <br />
<math><br />
\frac{1}{2} (KL(p_{orig},p_{augmix1})+KL(p_{orig}, p_{augmix2}))<br />
</math>. In other words, what is the priority of the JS over simple KL?</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=47989Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-30T00:28:19Z<p>Msikarou: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robustness and reduction in their accuracy as the models try to fit the noise for the predictions as well. A few corruptions have the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019), showing that the classification error rose from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruption is that it encourages the models or the network to memorize the specific corruptions and is, therefore, unable to generalize these corruptions. The paper also provides evidence that networks trained on translation augmentations are highly sensitive to the shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10, CIFAR100, ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix_Milad.gif|1000px|Image: 1000 pixels]]<br />
<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
<br />
'''1. Augmentations''': The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
<br />
'''2. Mixing''': The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. The intuition behind using a Dirichlet distribution is that it allows us to sample coefficients from (0, 1) that sum to 1. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
'''3. Jensen-Shannon divergence''': The author augments the original loss function with the Jensen-Shannon divergence loss to enforce stable and consistent output: [[File:loss fn.png]]<br />
<br />
<math>p_\text{orig}</math>, <math>p_\text{augmix1}</math> and <math>p_\text{augmix2}</math> are the posterior distributions of the original input <math>x_\text{orig}</math>, and its augmented variants: <math>x_\text{augmix1}, x_\text{augmix2}</math>, respectively.<br />
<br />
The JS in the above formula means the Jensen-Shannon divergence. It measures the similarities between distributions and is based on KL divergence. However, the Jensen-Shannon divergence is symmetric and can be viewed as a smoothed and normalized version of KL divergence. The JS divergence is particularly helpful when we are comparing multiple distributions.<br />
<br />
[[File:augmix 3.png|1000px|Image: 1000 pixels]]<br />
<br />
where KL means KL Divergence between porig and paugmix<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
For example, the pseudocode can be implemented in '''Python''' as follows:<br />
<syntaxhighlight lang="python"><br />
import numpy as np<br />
def augmix(orig_image, operations, k=3, alpha=1):<br />
aug_image = np.zeros(orig_image.shape)<br />
weights = np.random.dirichlet(np.ones(k)*alpha)<br />
for i in range(k):<br />
op1, op2, op3 = np.random.choice(operations, 3)<br />
chain = np.random.uniform()<br />
if 3*chain < 1:<br />
aug_image += op1(orig_image)<br />
elif 3*chain <2:<br />
aug_image += op2(op1(orig_image))<br />
else:<br />
aug_image += op3(op2(op1(orig_image)))<br />
m = np.random.beta(alpha, alpha)<br />
augmix = m*orig_image + (1-m)*aug_image<br />
return augmix<br />
</syntaxhighlight><br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10: This dataset, along with the CIFAR-100 dataset, are labeled subsets of the 80 million tiny images dataset and were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset is composed of 60000 color images of 32x32 pixels. These images are in 10 classes, with 6000 images per class, 50000 for training, and 10000 for testing. This dataset is used in numerous computer vision journals to compare their algorithms. - https://www.cs.toronto.edu/~kriz/cifar.html<br />
<br />
2. CIFAR 100: The difference between this dataset and the CIFAR-10 dataset is that it includes 100 classes of images with 600 images per each class. These classes are also grouped in 20 super-classes, e.g. the flowers' superclass that contains orchids, poppies, roses, sunflowers, and tulips. - https://www.cs.toronto.edu/~kriz/cifar.html<br />
<br />
3. ImageNet: This dataset aims to obtain at least 1000 images per "synonym set" or "sysnet" in the WordNet hierarchy. WordNet is a large lexical database of English where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. This dataset is currently home to 1.2 million labelled images. - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with a 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
== Source Code ==<br />
The source code is available at: https://github.com/google-research/augmix<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruption.<br />
<br />
<br />
== Critique ==<br />
<br />
The question is that is there any specific reason for using Jensen-Shannon divergence instead of a popular divergence like Kullback–Leibler?</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=47987Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-11-30T00:26:18Z<p>Msikarou: /* Approach */</p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robustness and reduction in their accuracy as the models try to fit the noise for the predictions as well. A few corruptions have the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019), showing that the classification error rose from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruption is that it encourages the models or the network to memorize the specific corruptions and is, therefore, unable to generalize these corruptions. The paper also provides evidence that networks trained on translation augmentations are highly sensitive to the shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10, CIFAR100, ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix_Milad.gif|1000px|Image: 1000 pixels]]<br />
<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
<br />
'''1. Augmentations''': The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
<br />
'''2. Mixing''': The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. The intuition behind using a Dirichlet distribution is that it allows us to sample coefficients from (0, 1) that sum to 1. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
'''3. Jensen-Shannon divergence''': The author augments the original loss function with the Jensen-Shannon divergence loss to enforce stable and consistent output: [[File:loss fn.png]]<br />
<br />
<math>p_\text{orig}</math>, <math>p_\text{augmix1}</math> and <math>p_\text{augmix2}</math> are the posterior distributions of the original input <math>x_\text{orig}</math>, and its augmented variants: <math>x_\text{augmix1}, x_\text{augmix2}</math>, respectively.<br />
<br />
The JS in the above formula means the Jensen-Shannon divergence. It measures the similarities between distributions and is based on KL divergence. However, the Jensen-Shannon divergence is symmetric and can be viewed as a smoothed and normalized version of KL divergence. The JS divergence is particularly helpful when we are comparing multiple distributions.<br />
<br />
[[File:augmix 3.png|1000px|Image: 1000 pixels]]<br />
<br />
where KL means KL Divergence between porig and paugmix<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|1000px|Image: 1000 pixels]]<br />
<br />
For example, the pseudocode can be implemented in '''Python''' as follows:<br />
<syntaxhighlight lang="python"><br />
import numpy as np<br />
def augmix(orig_image, operations, k=3, alpha=1):<br />
aug_image = np.zeros(orig_image.shape)<br />
weights = np.random.dirichlet(np.ones(k)*alpha)<br />
for i in range(k):<br />
op1, op2, op3 = np.random.choice(operations, 3)<br />
chain = np.random.uniform()<br />
if 3*chain < 1:<br />
aug_image += op1(orig_image)<br />
elif 3*chain <2:<br />
aug_image += op2(op1(orig_image))<br />
else:<br />
aug_image += op3(op2(op1(orig_image)))<br />
m = np.random.beta(alpha, alpha)<br />
augmix = m*orig_image + (1-m)*aug_image<br />
return augmix<br />
</syntaxhighlight><br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10: This dataset, along with the CIFAR-100 dataset, are labeled subsets of the 80 million tiny images dataset and were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset is composed of 60000 color images of 32x32 pixels. These images are in 10 classes, with 6000 images per class, 50000 for training, and 10000 for testing. This dataset is used in numerous computer vision journals to compare their algorithms. - https://www.cs.toronto.edu/~kriz/cifar.html<br />
<br />
2. CIFAR 100: The difference between this dataset and the CIFAR-10 dataset is that it includes 100 classes of images with 600 images per each class. These classes are also grouped in 20 super-classes, e.g. the flowers' superclass that contains orchids, poppies, roses, sunflowers, and tulips. - https://www.cs.toronto.edu/~kriz/cifar.html<br />
<br />
3. ImageNet: This dataset aims to obtain at least 1000 images per "synonym set" or "sysnet" in the WordNet hierarchy. WordNet is a large lexical database of English where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. This dataset is currently home to 1.2 million labelled images. - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
[[File:CIFAR 1.png|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with a 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
== Source Code ==<br />
The source code is available at: https://github.com/google-research/augmix<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruption.</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:augmix_Milad.gif&diff=47983File:augmix Milad.gif2020-11-30T00:23:27Z<p>Msikarou: </p>
<hr />
<div></div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=47772Dense Passage Retrieval for Open-Domain Question Answering2020-11-29T19:57:29Z<p>Msikarou: /* Source Code */</p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader. <br />
<br />
=== Mathematical Formulation ===<br />
Let's assume that we have a collection of <math> D </math> documents <math>d_1, d_2, \cdots, d_D </math> where each <math>d</math> is split into text passages of equal lengths as the basic retrieval units. Then the total number of passages is <math> M </math> in the corpus <math>\mathcal{C} = \{p_1, p_2, \ldots, p_{M}\} </math>. Each passage <math> p_i </math> is composed of a sequence of tokens <math> w^{(i)}_1, w^{(i)}_2, \cdots, w^{(i)}_{|p_i|} </math>. The task is to find a span of tokens <math> w^{(i)}_s, w^{(i)}_{s+1}, \cdots, w^{(i)}_{e} </math> from one of the passages <math> p_i </math> to answer the given question <math>q_i</math> . The corpus can contain millions of documents just to cover a wide variety of domains (e.g. Wikipedia) or even billions of them (e.g. the Web).<br />
The open-domain QA systems define the retriever as a function <math> R:(q,\mathcal{C}) \rightarrow \mathcal{C_F} </math> that takes as input a question <math>q</math> and a corpus <math>\mathcal{C}</math> and then returns a much smaller filter set of texts <math>\mathcal{C_F} \subset \mathcal{C}</math> , where <math>|\mathcal{C_F}| = k \ll |\mathcal{C}|</math>. The retriever can be evaluated by '''top-k retrieval accuracy''' that represents the fraction of relevant answers contained in <math>\mathcal{C_F}</math>.<br />
<br />
= Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
During the inference time, they used FAISS (Johnson et al., 2017) for the similarity search, which can be efficiently applied on billions of dense vectors<br />
<br />
== Training ==<br />
The goal in training the encoders is to create a vector space where the relevant pairs of question and passages (positive passages) have smaller distance than the irrelevant ones (negative passages) which is a metric learning problem.<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages.<br />
\begin{eqnarray}<br />
&& L(q_i, p^+_i, p^-_{i,1}, \cdots, p^-_{i,n}) \label{eq:training} \\<br />
&=& -\log \frac{ e^{\mathrm{sim}(q_i, p_i^+)} }{e^{\mathrm{sim}(q_i, p_i^+)} + \sum_{j=1}^n{e^{\mathrm{sim}(q_i, p^-_{i,j})}}}. \nonumber<br />
\end{eqnarray}<br />
While positive passage selection is simple, where the passage containing the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query <math>q_i</math>, positive passage <math>p_i</math>) in a mini-batch, then the negative passages for query <math>q_i</math> are the passages <math>p_i</math> when <math>j</math> is not equal to <math>i</math>.<br />
<br />
= Experimental Setup =<br />
The authors pre-processed Wikipedia documents, i.e. they removed semi-structured data, such as tables, info- boxes, lists, as well as the disambiguation pages and then splitted each document into passages of length 100 words. These passages form a candidate pool. The five QA datasets described below were used to build the train data. <br />
# '''Natural Questions (NQ)''' (Kwiatkowski et al., 2019): This was designed for the end-end question-answering tasks where the questions are Google search queries and the answers are annotated texts from the Wikipedia.<br />
# '''TriviaQA''' (Joshi et al., 2017): It is a collection of trivia questions and answers scraped from the Web.<br />
# '''WebQuestions (WQ)''' (Berant et al., 2013): It consists of questions selected using Google Search API where the answers are entities in Freebase<br />
# '''CuratedTREC (TREC)''' (Baudisˇ and Sˇedivy`, 2015): It is intended for the open-domain QA from unstructured corpora. It consists of TREC QA tracks and questions from other Web sources.<br />
# '''SQuAD v1.1''' (Rajpurkar et al., 2016): It is a popular benchmark dataset for reading comprehension. The annotators were given a paragraph from the Wikipedia and were supposed to write questions which answers could be found in the given text.<br />
To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px | center]]<br />
<br />
= Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== Main Results ==<br />
<br />
The table below compares the top-k (for k=20 or k=100) accuracies of different retrieval systems on various popular QA datasets. Top-k accuracy is the percentage of examples in the test set for which the correct outcome occurs in the k most likely outcomes as predicted by the network. As can be seen, DPR consistently outperforms BM25 on all datasets, with the exception of SQuAD. Additionally, DPR tends to perform particularly well when a smaller k value in chosen. The authors speculated that the lower performance of DPR on SQuAD was for two reasons inherent to the dataset itself. First, the SQuAD dataset has high lexical overlap between passages and questions. Second, the dataset is biased because it is collected from a small number of Wikipedia articles - a point which has been argued by other researchers as well.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
== Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px | center ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have strong impact on the model performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are high ranked by BM25, but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px | center]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't improve the model performance much.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss.<br />
<br />
== Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method which ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a singe server. Conversely, building inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= Experiments: Question Answering =<br />
The authors evaluated DPR on end-to-end QA systems (i.e., retriever + neural reader). The results showed that higher retriever accuracy typically leads to better final QA results, and the passages retrieved by DPR are more likely to contain the correct answers. As shown in the table below, QA systems using DPR generally perform better, except for SQuAD.<br />
<br />
[[File: QA.png | 600px | center]]<br />
<br />
=== Reader for End-to-End ===<br />
<br />
The reader would score the subset of <math>k</math> passages which were retrieved using DPR. Next, the reader would extract an answer span from each passage and assign a span score. The best span from the passage with the highest selection score is chosen as the final answer. To calculate these scores linear layers followed by a softmax are applied on top of the BERT output as follows:<br />
<br />
<math>\textbf{P}_{start,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (1)<br />
<br />
<math>\textbf{P}_{end,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (2)<br />
<br />
<math>\textbf{P}_{selected}= softmax(\hat{\textbf{P}}^T_i \textbf{w}_{selected}) \in \mathbb{R}^k </math> where <math> k </math> is the number of passages. (3)<br />
<br />
Here <math> \textbf{P}_i \in \mathbb{R}^{L \times h}</math> used in (1) and (2), <math>i</math> is one of the <math> k </math> passages and <math> h </math> is the hidden layer dimension of BERT's output. Additionally, <math> \hat{\textbf{P}} = [\textbf{P}^{[CLS]}_1,...,\textbf{P}^{[CLS]}_k] \in \mathbb{R}^{h \times k}</math>. Here <math> \textbf{w}_{start},\textbf{w}_{end},\textbf{w}_{selected} </math> are learnable vectors. The span score for <math> s </math>-th starting word to the <math> t</math>-th ending word of the <math> i </math>-th passage then can be calculated as <math> P_{start,i}(s) \times P_{end,i}(t) </math> where <math> s,t </math> are used to select the corresponding index of <math> \textbf{P}_{start,i}, \textbf{P}_{end,i} </math> respectively. The passage selection score for the <math> i</math>-th passage can be computed similarly, by indexing the <math>i</math>-th element of <math>\textbf{P}_{selected}</math>.<br />
<br />
= Source Code =<br />
<br />
<br />
The official code for this paper is freely available at <span class="plainlinks">[https://github.com/facebookresearch/DPR "official repository"]</span><br />
<br />
There are also other repositories here: 1- <span class="plainlinks">[https://github.com/huggingface/transformers transformers]</span> 2- <span class="plainlinks">[https://github.com/huggingface/transformers haystack]</span><br />
<br />
= Conclusion =<br />
In conclusion, this paper proposed a dense retrieval method which can generally outperform traditional bag-of-words methods in open domain question answering tasks. Dense retrieval methods make use of pre-training language model Bert and achieved state-of-art performance on many QA datasets. Dense retrieval methods are good at capturing word variants and semantic relationships, but are relatively weak at capturing exact keyword match. This paper also covers a few training techniques that are required to successfully train a dense retriever. Overall, learning dense representations for first stage retrieval can potentially improve QA system performance, and has received more attention. <br />
= Critiques =<br />
<br />
The bag of words method is an old out-dated retriever approach. The paper could have exploited results from other state-of-the-art algorithms (e.g Golden retriever) and compared the proposed method with them.<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification&diff=46494Extreme Multi-label Text Classification2020-11-26T01:35:29Z<p>Msikarou: /* APLC-XLNet */</p>
<hr />
<div>== Presented By ==<br />
Mohan Wu<br />
<br />
== Introduction ==<br />
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arise from the fact that the label set is so large so that most models would fail. The authors propose a new model called APLC-XLNet which fine-tunes the generalized autoregressive pre-trained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross-entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and showed that their method far outweighs the existing state-of-the-art models.<br />
<br />
== Motivation ==<br />
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2], and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] to the XMTC problem. The final challenge is the nature of the labeling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to propose APLC.<br />
<br />
== Related Work ==<br />
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.<br />
<br />
=== BOW Approaches ===<br />
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. This technique has been shown to achieve high accuracy; however, due to a large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights (PDSparse) to induce sparsity but still, this method is quite expensive. PPDSparse was introduced to parallelize PDSparse in a large-scaled distributed setting. Additionally, DiSMEC was introduced as another extension to PDSparse that allows for a distributed and parallel training mechanism which can be done efficiently by a novel negative sampling technique in addition to the pruning of spurious weights. Another approach is to simply apply a dimensional reduction technique on the label space. However, doing so has shown to have serious negative effects on the prediction accuracy due to loss of information in the compression phase. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach has shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.<br />
<br />
=== Deep Learning Approaches ===<br />
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.<br />
<br />
== APLC-XLNet ==<br />
APLC-XLNet consists of three parts: the pre-trained XLNet-Base as the base, the APLC output layer, and a fully connected hidden layer connecting the pooling layer of XLNet as the output layer, as it can be seen in figure 1. To recap, XLNet [8] is a generalized autoregressive pretraining language model based upon capturing bidirectional contexts through maximizing the expected likelihood over all permutations of the factorization<br />
the order which can be expressed as follows:<br />
<br />
\begin{equation}<br />
\underset{\theta}{\max}{ E_{z \sim Z_T}[\sum_{t=1}^{T} \log {p_{\theta}(x_{z_{t}}|x_{z<t})}] }<br />
\end{equation}<br />
<br />
where <math> \theta </math> denotes the parameters of the model, <math> Z_T </math> is the set of all possible permutations of the sequence with length T,<br />
<math> z_t </math> is the <math> t </math>-th element and <math> z < t </math> deontes the preceding <math> t-1 </math> elements in a permutation <math> z</math>.<br />
<br />
One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contain the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.<br />
<br />
[[File:Capture1111.JPG |center|600px]]<br />
<br />
<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div><br />
<br />
The authors define the probability of each label as follows:<br />
<br />
\begin{equation}<br />
p(y_{ij} | x) = <br />
\begin{cases} <br />
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\<br />
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t<br />
\end{cases}<br />
\end{equation}<br />
<br />
where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:<br />
<br />
\begin{equation}<br />
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))<br />
\end{equation}<br />
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.<br />
<br />
The number of parameters in this model is given by:<br />
\begin{equation}<br />
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)<br />
\end{equation}<br />
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:<br />
\begin{align}<br />
C &= C_h + \sum_{i=1}^K C_i \\<br />
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))<br />
\end{align}<br />
where <math> N_b </math> is the batch size.<br />
<br />
=== Training APLC-XLNet ===<br />
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[9] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[9] defined as:<br />
\begin{equation}<br />
\eta =<br />
\begin{cases} <br />
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\<br />
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w<br />
\end{cases}<br />
\end{equation}<br />
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.<br />
<br />
== Results ==<br />
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art models. The evaluation metric, P@k is defined as:<br />
\begin{equation}<br />
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i<br />
\end{equation}<br />
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.<br />
<br />
[[File:Paper.PNG|1000px|]]<br />
<br />
To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.<br />
<br />
[[File:XTMC2.PNG|1000px|]]<br />
<br />
The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.<br />
<br />
== Conclusion ==<br />
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.<br />
<br />
== Critques ==<br />
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient.<br />
<br />
== References ==<br />
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `<br />
Khudanpur, S. Extensions of recurrent neural network<br />
language model. In 2011 IEEE International Conference<br />
on Acoustics, Speech and Signal Processing (ICASSP),<br />
pp. 5528–5531. IEEE, 2011. <br />
<br />
[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings<br />
of the Thirteenth International Conference on Artificial<br />
Intelligence and Statistics, pp. 137–144, 2010. <br />
<br />
[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss<br />
functions for recommendation, tagging, ranking & other<br />
missing label applications. In Proceedings of the 22nd<br />
ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pp. 935–944. ACM, 2016. <br />
<br />
[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,<br />
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),<br />
2019a. <br />
<br />
[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,<br />
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability<br />
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016. <br />
<br />
[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´<br />
cient softmax approximation for gpus. In Proceedings of<br />
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017 <br />
<br />
[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and<br />
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from<br />
Transformers. In NeurIPS Science Meets Engineering of<br />
Deep Learning Workshop, 2019.<br />
<br />
[8] Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding." <br />
Advances in neural information processing systems. 2019.<br />
<br />
[9] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th<br />
Annual Meeting of the Association for Computational<br />
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification&diff=46474Extreme Multi-label Text Classification2020-11-25T21:14:14Z<p>Msikarou: /* References */</p>
<hr />
<div>== Presented By ==<br />
Mohan Wu<br />
<br />
== Introduction ==<br />
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arise from the fact that the label set is so large so that most models would fail. The authors propose a new model called APLC-XLNet which fine-tunes the generalized autoregressive pre-trained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross-entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and showed that their method far outweighs the existing state-of-the-art models.<br />
<br />
== Motivation ==<br />
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2], and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] to the XMTC problem. The final challenge is the nature of the labeling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to propose APLC.<br />
<br />
== Related Work ==<br />
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.<br />
<br />
=== BOW Approaches ===<br />
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. This technique has been shown to achieve high accuracy; however, due to a large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights (PDSparse) to induce sparsity but still, this method is quite expensive. PPDSparse was introduced to parallelize PDSparse in a large-scaled distributed setting. Additionally, DiSMEC was introduced as another extension to PDSparse that allows for a distributed and parallel training mechanism which can be done efficiently by a novel negative sampling technique in addition to the pruning of spurious weights. Another approach is to simply apply a dimensional reduction technique on the label space. However, doing so has shown to have serious negative effects on the prediction accuracy due to loss of information in the compression phase. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach has shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.<br />
<br />
=== Deep Learning Approaches ===<br />
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.<br />
<br />
== APLC-XLNet ==<br />
APLC-XLNet consists of three parts: the pre-trained XLNet-Base as the base, the APLC output layer, and a fully connected hidden layer connecting the pooling layer of XLNet as the output layer, as it can be seen in figure 1. To recap, XLNet [8] is a generalized autoregressive pretraining language model based upon capturing bidirectional contexts through maximizing the expected likelihood over all permutations of the factorization<br />
the order which can be expressed as follows:<br />
<br />
\begin{equation}<br />
\underset{\theta}{\max}{ E_{z \sim Z_T}[\sum_{t=1}^{T} \log {p_{\theta}(x_{z_{t}}|x_{z<t})}] }<br />
\end{equation}<br />
<br />
One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contain the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.<br />
<br />
[[File:Capture1111.JPG |center|600px]]<br />
<br />
<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div><br />
<br />
The authors define the probability of each label as follows:<br />
<br />
\begin{equation}<br />
p(y_{ij} | x) = <br />
\begin{cases} <br />
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\<br />
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t<br />
\end{cases}<br />
\end{equation}<br />
<br />
where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:<br />
<br />
\begin{equation}<br />
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))<br />
\end{equation}<br />
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.<br />
<br />
The number of parameters in this model is given by:<br />
\begin{equation}<br />
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)<br />
\end{equation}<br />
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:<br />
\begin{align}<br />
C &= C_h + \sum_{i=1}^K C_i \\<br />
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))<br />
\end{align}<br />
where <math> N_b </math> is the batch size.<br />
<br />
=== Training APLC-XLNet ===<br />
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[9] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[9] defined as:<br />
\begin{equation}<br />
\eta =<br />
\begin{cases} <br />
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\<br />
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w<br />
\end{cases}<br />
\end{equation}<br />
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.<br />
<br />
== Results ==<br />
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art models. The evaluation metric, P@k is defined as:<br />
\begin{equation}<br />
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i<br />
\end{equation}<br />
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.<br />
<br />
[[File:Paper.PNG|1000px|]]<br />
<br />
To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.<br />
<br />
[[File:XTMC2.PNG|1000px|]]<br />
<br />
The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.<br />
<br />
== Conclusion ==<br />
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.<br />
<br />
== Critques ==<br />
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient.<br />
<br />
== References ==<br />
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `<br />
Khudanpur, S. Extensions of recurrent neural network<br />
language model. In 2011 IEEE International Conference<br />
on Acoustics, Speech and Signal Processing (ICASSP),<br />
pp. 5528–5531. IEEE, 2011. <br />
<br />
[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings<br />
of the Thirteenth International Conference on Artificial<br />
Intelligence and Statistics, pp. 137–144, 2010. <br />
<br />
[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss<br />
functions for recommendation, tagging, ranking & other<br />
missing label applications. In Proceedings of the 22nd<br />
ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pp. 935–944. ACM, 2016. <br />
<br />
[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,<br />
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),<br />
2019a. <br />
<br />
[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,<br />
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability<br />
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016. <br />
<br />
[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´<br />
cient softmax approximation for gpus. In Proceedings of<br />
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017 <br />
<br />
[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and<br />
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from<br />
Transformers. In NeurIPS Science Meets Engineering of<br />
Deep Learning Workshop, 2019.<br />
<br />
[8] Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding." <br />
Advances in neural information processing systems. 2019.<br />
<br />
[9] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th<br />
Annual Meeting of the Association for Computational<br />
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification&diff=46473Extreme Multi-label Text Classification2020-11-25T21:13:28Z<p>Msikarou: /* APLC-XLNet */</p>
<hr />
<div>== Presented By ==<br />
Mohan Wu<br />
<br />
== Introduction ==<br />
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arise from the fact that the label set is so large so that most models would fail. The authors propose a new model called APLC-XLNet which fine-tunes the generalized autoregressive pre-trained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross-entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and showed that their method far outweighs the existing state-of-the-art models.<br />
<br />
== Motivation ==<br />
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2], and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] to the XMTC problem. The final challenge is the nature of the labeling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to propose APLC.<br />
<br />
== Related Work ==<br />
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.<br />
<br />
=== BOW Approaches ===<br />
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. This technique has been shown to achieve high accuracy; however, due to a large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights (PDSparse) to induce sparsity but still, this method is quite expensive. PPDSparse was introduced to parallelize PDSparse in a large-scaled distributed setting. Additionally, DiSMEC was introduced as another extension to PDSparse that allows for a distributed and parallel training mechanism which can be done efficiently by a novel negative sampling technique in addition to the pruning of spurious weights. Another approach is to simply apply a dimensional reduction technique on the label space. However, doing so has shown to have serious negative effects on the prediction accuracy due to loss of information in the compression phase. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach has shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.<br />
<br />
=== Deep Learning Approaches ===<br />
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.<br />
<br />
== APLC-XLNet ==<br />
APLC-XLNet consists of three parts: the pre-trained XLNet-Base as the base, the APLC output layer, and a fully connected hidden layer connecting the pooling layer of XLNet as the output layer, as it can be seen in figure 1. To recap, XLNet [8] is a generalized autoregressive pretraining language model based upon capturing bidirectional contexts through maximizing the expected likelihood over all permutations of the factorization<br />
the order which can be expressed as follows:<br />
<br />
\begin{equation}<br />
\underset{\theta}{\max}{ E_{z \sim Z_T}[\sum_{t=1}^{T} \log {p_{\theta}(x_{z_{t}}|x_{z<t})}] }<br />
\end{equation}<br />
<br />
One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contain the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.<br />
<br />
[[File:Capture1111.JPG |center|600px]]<br />
<br />
<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div><br />
<br />
The authors define the probability of each label as follows:<br />
<br />
\begin{equation}<br />
p(y_{ij} | x) = <br />
\begin{cases} <br />
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\<br />
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t<br />
\end{cases}<br />
\end{equation}<br />
<br />
where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:<br />
<br />
\begin{equation}<br />
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))<br />
\end{equation}<br />
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.<br />
<br />
The number of parameters in this model is given by:<br />
\begin{equation}<br />
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)<br />
\end{equation}<br />
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:<br />
\begin{align}<br />
C &= C_h + \sum_{i=1}^K C_i \\<br />
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))<br />
\end{align}<br />
where <math> N_b </math> is the batch size.<br />
<br />
=== Training APLC-XLNet ===<br />
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[9] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[9] defined as:<br />
\begin{equation}<br />
\eta =<br />
\begin{cases} <br />
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\<br />
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w<br />
\end{cases}<br />
\end{equation}<br />
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.<br />
<br />
== Results ==<br />
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art models. The evaluation metric, P@k is defined as:<br />
\begin{equation}<br />
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i<br />
\end{equation}<br />
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.<br />
<br />
[[File:Paper.PNG|1000px|]]<br />
<br />
To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.<br />
<br />
[[File:XTMC2.PNG|1000px|]]<br />
<br />
The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.<br />
<br />
== Conclusion ==<br />
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.<br />
<br />
== Critques ==<br />
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient.<br />
<br />
== References ==<br />
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `<br />
Khudanpur, S. Extensions of recurrent neural network<br />
language model. In 2011 IEEE International Conference<br />
on Acoustics, Speech and Signal Processing (ICASSP),<br />
pp. 5528–5531. IEEE, 2011. <br />
<br />
[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings<br />
of the Thirteenth International Conference on Artificial<br />
Intelligence and Statistics, pp. 137–144, 2010. <br />
<br />
[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss<br />
functions for recommendation, tagging, ranking & other<br />
missing label applications. In Proceedings of the 22nd<br />
ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pp. 935–944. ACM, 2016. <br />
<br />
[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,<br />
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),<br />
2019a. <br />
<br />
[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,<br />
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability<br />
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016. <br />
<br />
[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´<br />
cient softmax approximation for gpus. In Proceedings of<br />
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017 <br />
<br />
[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and<br />
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from<br />
Transformers. In NeurIPS Science Meets Engineering of<br />
Deep Learning Workshop, 2019.<br />
<br />
[8] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th<br />
Annual Meeting of the Association for Computational<br />
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification&diff=46472Extreme Multi-label Text Classification2020-11-25T21:09:57Z<p>Msikarou: /* APLC-XLNet */</p>
<hr />
<div>== Presented By ==<br />
Mohan Wu<br />
<br />
== Introduction ==<br />
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arise from the fact that the label set is so large so that most models would fail. The authors propose a new model called APLC-XLNet which fine-tunes the generalized autoregressive pre-trained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross-entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and showed that their method far outweighs the existing state-of-the-art models.<br />
<br />
== Motivation ==<br />
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2], and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] to the XMTC problem. The final challenge is the nature of the labeling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to propose APLC.<br />
<br />
== Related Work ==<br />
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.<br />
<br />
=== BOW Approaches ===<br />
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. This technique has been shown to achieve high accuracy; however, due to a large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights (PDSparse) to induce sparsity but still, this method is quite expensive. PPDSparse was introduced to parallelize PDSparse in a large-scaled distributed setting. Additionally, DiSMEC was introduced as another extension to PDSparse that allows for a distributed and parallel training mechanism which can be done efficiently by a novel negative sampling technique in addition to the pruning of spurious weights. Another approach is to simply apply a dimensional reduction technique on the label space. However, doing so has shown to have serious negative effects on the prediction accuracy due to loss of information in the compression phase. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach has shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.<br />
<br />
=== Deep Learning Approaches ===<br />
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.<br />
<br />
== APLC-XLNet ==<br />
APLC-XLNet consists of three parts: the pre-trained XLNet-Base as the base, the APLC output layer, and a fully connected hidden layer connecting the pooling layer of XLNet as the output layer, as it can be seen in figure 1. To recap, XLNet (Yang et al., 2019b) is a generalized autoregressive pretraining language model based upon capturing bidirectional contexts through maximizing the expected likelihood over all permutations of the factorization<br />
the order which can be expressed as follows:<br />
<br />
\begin{equation}<br />
\underset{\theta}{\max}{ E_{z \sim Z_T}[\sum_{t=1}^{T} \log {p_{\theta}(x_{z_{t}}|x_{z<t})}] }<br />
\end{equation}<br />
<br />
One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contain the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.<br />
<br />
[[File:Capture1111.JPG |center|600px]]<br />
<br />
<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div><br />
<br />
The authors define the probability of each label as follows:<br />
<br />
\begin{equation}<br />
p(y_{ij} | x) = <br />
\begin{cases} <br />
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\<br />
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t<br />
\end{cases}<br />
\end{equation}<br />
<br />
where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:<br />
<br />
\begin{equation}<br />
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))<br />
\end{equation}<br />
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.<br />
<br />
The number of parameters in this model is given by:<br />
\begin{equation}<br />
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)<br />
\end{equation}<br />
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:<br />
\begin{align}<br />
C &= C_h + \sum_{i=1}^K C_i \\<br />
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))<br />
\end{align}<br />
where <math> N_b </math> is the batch size.<br />
<br />
=== Training APLC-XLNet ===<br />
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[8] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[8] defined as:<br />
\begin{equation}<br />
\eta =<br />
\begin{cases} <br />
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\<br />
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w<br />
\end{cases}<br />
\end{equation}<br />
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.<br />
<br />
== Results ==<br />
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art models. The evaluation metric, P@k is defined as:<br />
\begin{equation}<br />
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i<br />
\end{equation}<br />
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.<br />
<br />
[[File:Paper.PNG|1000px|]]<br />
<br />
To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.<br />
<br />
[[File:XTMC2.PNG|1000px|]]<br />
<br />
The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.<br />
<br />
== Conclusion ==<br />
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.<br />
<br />
== Critques ==<br />
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient.<br />
<br />
== References ==<br />
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `<br />
Khudanpur, S. Extensions of recurrent neural network<br />
language model. In 2011 IEEE International Conference<br />
on Acoustics, Speech and Signal Processing (ICASSP),<br />
pp. 5528–5531. IEEE, 2011. <br />
<br />
[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings<br />
of the Thirteenth International Conference on Artificial<br />
Intelligence and Statistics, pp. 137–144, 2010. <br />
<br />
[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss<br />
functions for recommendation, tagging, ranking & other<br />
missing label applications. In Proceedings of the 22nd<br />
ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pp. 935–944. ACM, 2016. <br />
<br />
[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,<br />
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),<br />
2019a. <br />
<br />
[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,<br />
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability<br />
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016. <br />
<br />
[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´<br />
cient softmax approximation for gpus. In Proceedings of<br />
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017 <br />
<br />
[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and<br />
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from<br />
Transformers. In NeurIPS Science Meets Engineering of<br />
Deep Learning Workshop, 2019.<br />
<br />
[8] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th<br />
Annual Meeting of the Association for Computational<br />
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification&diff=46447Extreme Multi-label Text Classification2020-11-25T16:33:08Z<p>Msikarou: /* BOW Approaches */</p>
<hr />
<div>== Presented By ==<br />
Mohan Wu<br />
<br />
== Introduction ==<br />
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arise from the fact that the label set is so large so that most models would fail. The authors propose a new model called APLC-XLNet which fine-tunes the generalized autoregressive pre-trained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross-entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and showed that their method far outweighs the existing state-of-the-art models.<br />
<br />
== Motivation ==<br />
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2], and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] to the XMTC problem. The final challenge is the nature of the labeling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to propose APLC.<br />
<br />
== Related Work ==<br />
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.<br />
<br />
=== BOW Approaches ===<br />
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. This technique has been shown to achieve high accuracy; however, due to a large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights (PDSparse) to induce sparsity but still, this method is quite expensive. PPDSparse was introduced to parallelize PDSparse in a large-scaled distributed setting. Additionally, DiSMEC was introduced as another extension to PDSparse that allows for a distributed and parallel training mechanism which can be done efficiently by a novel negative sampling technique in addition to the pruning of spurious weights. Another approach is to simply apply a dimensional reduction technique on the label space. However, doing so has shown to have serious negative effects on the prediction accuracy due to loss of information in the compression phase. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach has shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.<br />
<br />
=== Deep Learning Approaches ===<br />
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.<br />
<br />
== APLC-XLNet ==<br />
APLC-XLNet consists of three parts: the pretrained XLNet-Base as the base, the APLC output layer and a fully connected hidden layer connecting the pool layer of XLNet the output layer, as it can be seen in figure 1. One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contains the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.<br />
<br />
[[File:Capture1111.JPG |center|600px]]<br />
<br />
<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div><br />
<br />
The authors define the probability of each label as follows:<br />
<br />
\begin{equation}<br />
p(y_{ij} | x) = <br />
\begin{cases} <br />
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\<br />
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t<br />
\end{cases}<br />
\end{equation}<br />
<br />
where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:<br />
<br />
\begin{equation}<br />
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))<br />
\end{equation}<br />
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.<br />
<br />
The number of parameters in this model is given by:<br />
\begin{equation}<br />
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)<br />
\end{equation}<br />
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:<br />
\begin{align}<br />
C &= C_h + \sum_{i=1}^K C_i \\<br />
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))<br />
\end{align}<br />
where <math> N_b </math> is the batch size.<br />
<br />
=== Training APLC-XLNet ===<br />
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[8] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[8] defined as:<br />
\begin{equation}<br />
\eta =<br />
\begin{cases} <br />
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\<br />
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w<br />
\end{cases}<br />
\end{equation}<br />
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.<br />
<br />
== Results ==<br />
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art models. The evaluation metric, P@k is defined as:<br />
\begin{equation}<br />
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i<br />
\end{equation}<br />
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.<br />
<br />
[[File:Paper.PNG|1000px|]]<br />
<br />
To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.<br />
<br />
[[File:XTMC2.PNG|1000px|]]<br />
<br />
The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.<br />
<br />
== Conclusion ==<br />
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.<br />
<br />
== Critques ==<br />
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient.<br />
<br />
== References ==<br />
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `<br />
Khudanpur, S. Extensions of recurrent neural network<br />
language model. In 2011 IEEE International Conference<br />
on Acoustics, Speech and Signal Processing (ICASSP),<br />
pp. 5528–5531. IEEE, 2011. <br />
<br />
[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings<br />
of the Thirteenth International Conference on Artificial<br />
Intelligence and Statistics, pp. 137–144, 2010. <br />
<br />
[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss<br />
functions for recommendation, tagging, ranking & other<br />
missing label applications. In Proceedings of the 22nd<br />
ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pp. 935–944. ACM, 2016. <br />
<br />
[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,<br />
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),<br />
2019a. <br />
<br />
[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,<br />
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability<br />
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016. <br />
<br />
[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´<br />
cient softmax approximation for gpus. In Proceedings of<br />
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017 <br />
<br />
[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and<br />
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from<br />
Transformers. In NeurIPS Science Meets Engineering of<br />
Deep Learning Workshop, 2019.<br />
<br />
[8] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th<br />
Annual Meeting of the Association for Computational<br />
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification&diff=46446Extreme Multi-label Text Classification2020-11-25T16:30:59Z<p>Msikarou: /* Motivation */</p>
<hr />
<div>== Presented By ==<br />
Mohan Wu<br />
<br />
== Introduction ==<br />
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arise from the fact that the label set is so large so that most models would fail. The authors propose a new model called APLC-XLNet which fine-tunes the generalized autoregressive pre-trained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross-entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and showed that their method far outweighs the existing state-of-the-art models.<br />
<br />
== Motivation ==<br />
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2], and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] to the XMTC problem. The final challenge is the nature of the labeling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to propose APLC.<br />
<br />
== Related Work ==<br />
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.<br />
<br />
=== BOW Approaches ===<br />
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. This technique has been shown to achieve high accuracy; however, due to the large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights (PDSparse) to induce sparsity but still this method is quite expensive. PPDSparse was introduced to parallelize PDSparse in a large scaled distributed setting. Additionally, DiSMEC was introduced as another extention to PDSparse that allows for distributed and parallel training mechanism which can be done efficiently by a novel negative sampling technique in addition to the pruning of spurious weights. Another approach is to simply apply a dimensional reduction technique on the label space. However doing so have shown to have serious negative effects on the prediction accuracy due to loss of information in the compression phase. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach have shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.<br />
<br />
=== Deep Learning Approaches ===<br />
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.<br />
<br />
== APLC-XLNet ==<br />
APLC-XLNet consists of three parts: the pretrained XLNet-Base as the base, the APLC output layer and a fully connected hidden layer connecting the pool layer of XLNet the output layer, as it can be seen in figure 1. One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contains the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.<br />
<br />
[[File:Capture1111.JPG |center|600px]]<br />
<br />
<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div><br />
<br />
The authors define the probability of each label as follows:<br />
<br />
\begin{equation}<br />
p(y_{ij} | x) = <br />
\begin{cases} <br />
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\<br />
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t<br />
\end{cases}<br />
\end{equation}<br />
<br />
where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:<br />
<br />
\begin{equation}<br />
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))<br />
\end{equation}<br />
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.<br />
<br />
The number of parameters in this model is given by:<br />
\begin{equation}<br />
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)<br />
\end{equation}<br />
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:<br />
\begin{align}<br />
C &= C_h + \sum_{i=1}^K C_i \\<br />
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))<br />
\end{align}<br />
where <math> N_b </math> is the batch size.<br />
<br />
=== Training APLC-XLNet ===<br />
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[8] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[8] defined as:<br />
\begin{equation}<br />
\eta =<br />
\begin{cases} <br />
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\<br />
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w<br />
\end{cases}<br />
\end{equation}<br />
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.<br />
<br />
== Results ==<br />
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art models. The evaluation metric, P@k is defined as:<br />
\begin{equation}<br />
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i<br />
\end{equation}<br />
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.<br />
<br />
[[File:Paper.PNG|1000px|]]<br />
<br />
To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.<br />
<br />
[[File:XTMC2.PNG|1000px|]]<br />
<br />
The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.<br />
<br />
== Conclusion ==<br />
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.<br />
<br />
== Critques ==<br />
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient.<br />
<br />
== References ==<br />
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `<br />
Khudanpur, S. Extensions of recurrent neural network<br />
language model. In 2011 IEEE International Conference<br />
on Acoustics, Speech and Signal Processing (ICASSP),<br />
pp. 5528–5531. IEEE, 2011. <br />
<br />
[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings<br />
of the Thirteenth International Conference on Artificial<br />
Intelligence and Statistics, pp. 137–144, 2010. <br />
<br />
[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss<br />
functions for recommendation, tagging, ranking & other<br />
missing label applications. In Proceedings of the 22nd<br />
ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pp. 935–944. ACM, 2016. <br />
<br />
[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,<br />
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),<br />
2019a. <br />
<br />
[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,<br />
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability<br />
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016. <br />
<br />
[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´<br />
cient softmax approximation for gpus. In Proceedings of<br />
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017 <br />
<br />
[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and<br />
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from<br />
Transformers. In NeurIPS Science Meets Engineering of<br />
Deep Learning Workshop, 2019.<br />
<br />
[8] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th<br />
Annual Meeting of the Association for Computational<br />
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification&diff=46445Extreme Multi-label Text Classification2020-11-25T16:21:22Z<p>Msikarou: /* Introduction */</p>
<hr />
<div>== Presented By ==<br />
Mohan Wu<br />
<br />
== Introduction ==<br />
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text; the difficulties arise from the fact that the label set is so large so that most models would fail. The authors propose a new model called APLC-XLNet which fine-tunes the generalized autoregressive pre-trained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross-entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and showed that their method far outweighs the existing state-of-the-art models.<br />
<br />
== Motivation ==<br />
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia with relevant labels [2] and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems, the authors propose to adapt XLNet [4] on the XMTC problem. The final challenge is the nature of the labelling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to create APLC.<br />
<br />
== Related Work ==<br />
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.<br />
<br />
=== BOW Approaches ===<br />
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. This technique has been shown to achieve high accuracy; however, due to the large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights (PDSparse) to induce sparsity but still this method is quite expensive. PPDSparse was introduced to parallelize PDSparse in a large scaled distributed setting. Additionally, DiSMEC was introduced as another extention to PDSparse that allows for distributed and parallel training mechanism which can be done efficiently by a novel negative sampling technique in addition to the pruning of spurious weights. Another approach is to simply apply a dimensional reduction technique on the label space. However doing so have shown to have serious negative effects on the prediction accuracy due to loss of information in the compression phase. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach have shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.<br />
<br />
=== Deep Learning Approaches ===<br />
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.<br />
<br />
== APLC-XLNet ==<br />
APLC-XLNet consists of three parts: the pretrained XLNet-Base as the base, the APLC output layer and a fully connected hidden layer connecting the pool layer of XLNet the output layer, as it can be seen in figure 1. One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contains the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.<br />
<br />
[[File:Capture1111.JPG |center|600px]]<br />
<br />
<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div><br />
<br />
The authors define the probability of each label as follows:<br />
<br />
\begin{equation}<br />
p(y_{ij} | x) = <br />
\begin{cases} <br />
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\<br />
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t<br />
\end{cases}<br />
\end{equation}<br />
<br />
where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:<br />
<br />
\begin{equation}<br />
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))<br />
\end{equation}<br />
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.<br />
<br />
The number of parameters in this model is given by:<br />
\begin{equation}<br />
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)<br />
\end{equation}<br />
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:<br />
\begin{align}<br />
C &= C_h + \sum_{i=1}^K C_i \\<br />
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))<br />
\end{align}<br />
where <math> N_b </math> is the batch size.<br />
<br />
=== Training APLC-XLNet ===<br />
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method[8] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[8] defined as:<br />
\begin{equation}<br />
\eta =<br />
\begin{cases} <br />
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\<br />
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w<br />
\end{cases}<br />
\end{equation}<br />
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.<br />
<br />
== Results ==<br />
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art models. The evaluation metric, P@k is defined as:<br />
\begin{equation}<br />
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i<br />
\end{equation}<br />
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.<br />
<br />
[[File:Paper.PNG|1000px|]]<br />
<br />
To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.<br />
<br />
[[File:XTMC2.PNG|1000px|]]<br />
<br />
The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.<br />
<br />
== Conclusion ==<br />
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.<br />
<br />
== Critques ==<br />
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on the larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort in explaining the model complexity for APLC-XLNet but does not compare it other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient.<br />
<br />
== References ==<br />
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `<br />
Khudanpur, S. Extensions of recurrent neural network<br />
language model. In 2011 IEEE International Conference<br />
on Acoustics, Speech and Signal Processing (ICASSP),<br />
pp. 5528–5531. IEEE, 2011. <br />
<br />
[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings<br />
of the Thirteenth International Conference on Artificial<br />
Intelligence and Statistics, pp. 137–144, 2010. <br />
<br />
[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss<br />
functions for recommendation, tagging, ranking & other<br />
missing label applications. In Proceedings of the 22nd<br />
ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pp. 935–944. ACM, 2016. <br />
<br />
[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,<br />
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),<br />
2019a. <br />
<br />
[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,<br />
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability<br />
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016. <br />
<br />
[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´<br />
cient softmax approximation for gpus. In Proceedings of<br />
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017 <br />
<br />
[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and<br />
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from<br />
Transformers. In NeurIPS Science Meets Engineering of<br />
Deep Learning Workshop, 2019.<br />
<br />
[8] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th<br />
Annual Meeting of the Association for Computational<br />
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=45989Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-23T03:00:15Z<p>Msikarou: /* References */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods, which can artificially inflate the dataset, become particularly useful in these situations; however, such techniques are often highly dependent on the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee the robustness of convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small number of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the Spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left-hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivatives of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of data points than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full-time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of data points at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of data points at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change in procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after a sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also, assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 data points across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the data points are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also, assume that the value of the solution for each of the known data points is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the Limited-memory BFGS (L-BFGS) optimizer.<br />
<br />
==== L-BFGS optimizer [5] ====<br />
<br />
L-BFGS is an optimization algorithm in the family of quasi-Newton methods and a popular algorithm for parameter estimation in machine learning. The aim in L-BFGS is to minimize f(x) over unconstrained values of the real-vector x where f is a differentiable scalar function. L-BFGS stores only fewer vectors that represent the approximation to the inverse hessian implicitly in comparison to the original BFGS.<br />
<br />
=== Results ===<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the data points selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One fascinating example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by an additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
==Critiques==<br />
<br />
Although this paper has presented very interesting results and makes a bridge between machine learning and classical computational physics, some questions are still unanswered. For example, how deep should the neural network be? How much data is needed? Why the optimizer is not suffering from being trapped at local optima for the parameters of the differential operators ? Can weight initialization and data normalization be improved? Why these methods seem to be very robust to noise in data? How can uncertainty in predictions be interpreted which hints us to the concept of interpretable AI. The answers to these questions can be next steps for this research direction.<br />
<br />
In this paper, a Quasi-Newton optimizer has been used to update parameters. Although they are more powerful that second order optimizers, however, due to their computational load, they are not the common choice in today's deep learning packages. Considering this, do the first order optimizers handle updating the weights in such a problem? Or they may get stuck in local minima? There is no such experiment in the paper.<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information on physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for the prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely available on GitHub [4]. It is quite easy to go through and learn - although unfortunately, it is written in TensorFlow v1.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs<br />
<br />
[5] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimization." Mathematical programming 45.1-3 (1989): 503-528.</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=45986Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-23T02:59:59Z<p>Msikarou: /* L-BFGS optimizer */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods, which can artificially inflate the dataset, become particularly useful in these situations; however, such techniques are often highly dependent on the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee the robustness of convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small number of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the Spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left-hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivatives of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of data points than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full-time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of data points at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of data points at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change in procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after a sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also, assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 data points across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the data points are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also, assume that the value of the solution for each of the known data points is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the Limited-memory BFGS (L-BFGS) optimizer.<br />
<br />
==== L-BFGS optimizer [5] ====<br />
<br />
L-BFGS is an optimization algorithm in the family of quasi-Newton methods and a popular algorithm for parameter estimation in machine learning. The aim in L-BFGS is to minimize f(x) over unconstrained values of the real-vector x where f is a differentiable scalar function. L-BFGS stores only fewer vectors that represent the approximation to the inverse hessian implicitly in comparison to the original BFGS.<br />
<br />
=== Results ===<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the data points selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One fascinating example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by an additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
==Critiques==<br />
<br />
Although this paper has presented very interesting results and makes a bridge between machine learning and classical computational physics, some questions are still unanswered. For example, how deep should the neural network be? How much data is needed? Why the optimizer is not suffering from being trapped at local optima for the parameters of the differential operators ? Can weight initialization and data normalization be improved? Why these methods seem to be very robust to noise in data? How can uncertainty in predictions be interpreted which hints us to the concept of interpretable AI. The answers to these questions can be next steps for this research direction.<br />
<br />
In this paper, a Quasi-Newton optimizer has been used to update parameters. Although they are more powerful that second order optimizers, however, due to their computational load, they are not the common choice in today's deep learning packages. Considering this, do the first order optimizers handle updating the weights in such a problem? Or they may get stuck in local minima? There is no such experiment in the paper.<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information on physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for the prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely available on GitHub [4]. It is quite easy to go through and learn - although unfortunately, it is written in TensorFlow v1.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=45954Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-23T01:20:37Z<p>Msikarou: /* L-BFGS optimizer */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods, which can artificially inflate the dataset, become particularly useful in these situations; however, such techniques are often highly dependent on the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee the robustness of convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small number of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the Spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left-hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivatives of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of data points than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full-time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of data points at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of data points at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change in procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after a sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also, assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 data points across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the data points are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also, assume that the value of the solution for each of the known data points is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the Limited-memory BFGS (L-BFGS) optimizer.<br />
<br />
==== L-BFGS optimizer ====<br />
<br />
L-BFGS is an optimization algorithm in the family of quasi-Newton methods and a popular algorithm for parameter estimation in machine learning. The aim in L-BFGS is to minimize f(x) over unconstrained values of the real-vector x where f is a differentiable scalar function. L-BFGS stores only fewer vectors that represent the approximation to the inverse hessian implicitly in comparison to the original BFGS.<br />
<br />
=== Results ===<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the data points selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One fascinating example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by an additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
==Critiques==<br />
<br />
Although this paper has presented very interesting results and makes a bridge between machine learning and classical computational physics, some questions are still unanswered. For example, how deep should the neural network be? How much data is needed? Why the optimizer is not suffering from being trapped at local optima for the parameters of the differential operators ? Can weight initialization and data normalization be improved? Why these methods seem to be very robust to noise in data? How can uncertainty in predictions be interpreted which hints us to the concept of interpretable AI. The answers to these questions can be next steps for this research direction.<br />
<br />
In this paper, a Quasi-Newton optimizer has been used to update parameters. Although they are more powerful that second order optimizers, however, due to their computational load, they are not the common choice in today's deep learning packages. Considering this, do the first order optimizers handle updating the weights in such a problem? Or they may get stuck in local minima? There is no such experiment in the paper.<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information on physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for the prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely available on GitHub [4]. It is quite easy to go through and learn - although unfortunately, it is written in TensorFlow v1.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=45952Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-23T01:20:13Z<p>Msikarou: /* L-BFGS optimizer */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods, which can artificially inflate the dataset, become particularly useful in these situations; however, such techniques are often highly dependent on the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee the robustness of convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small number of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the Spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left-hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivatives of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of data points than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full-time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of data points at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of data points at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change in procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after a sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also, assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 data points across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the data points are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also, assume that the value of the solution for each of the known data points is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the Limited-memory BFGS (L-BFGS) optimizer.<br />
<br />
=== L-BFGS optimizer ===<br />
<br />
L-BFGS is an optimization algorithm in the family of quasi-Newton methods and a popular algorithm for parameter estimation in machine learning. The aim in L-BFGS is to minimize f(x) over unconstrained values of the real-vector x where f is a differentiable scalar function. L-BFGS stores only fewer vectors that represent the approximation to the inverse hessian implicitly in comparison to the original BFGS. <br />
<br />
<br />
<br />
=== Results ===<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the data points selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One fascinating example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by an additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
==Critiques==<br />
<br />
Although this paper has presented very interesting results and makes a bridge between machine learning and classical computational physics, some questions are still unanswered. For example, how deep should the neural network be? How much data is needed? Why the optimizer is not suffering from being trapped at local optima for the parameters of the differential operators ? Can weight initialization and data normalization be improved? Why these methods seem to be very robust to noise in data? How can uncertainty in predictions be interpreted which hints us to the concept of interpretable AI. The answers to these questions can be next steps for this research direction.<br />
<br />
In this paper, a Quasi-Newton optimizer has been used to update parameters. Although they are more powerful that second order optimizers, however, due to their computational load, they are not the common choice in today's deep learning packages. Considering this, do the first order optimizers handle updating the weights in such a problem? Or they may get stuck in local minima? There is no such experiment in the paper.<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information on physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for the prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely available on GitHub [4]. It is quite easy to go through and learn - although unfortunately, it is written in TensorFlow v1.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=45951Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-23T01:18:43Z<p>Msikarou: /* Continuous-Time Example */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods, which can artificially inflate the dataset, become particularly useful in these situations; however, such techniques are often highly dependent on the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee the robustness of convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small number of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the Spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left-hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivatives of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of data points than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full-time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of data points at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of data points at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change in procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after a sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also, assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 data points across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the data points are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also, assume that the value of the solution for each of the known data points is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the Limited-memory BFGS (L-BFGS) optimizer.<br />
<br />
=== L-BFGS optimizer ===<br />
<br />
L-BFGS is an optimization algorithm in the family of quasi-Newton methods and a popular algorithm for parameter estimation in machine learning. The aim in L-BFGS is to minimize <math> f(\mathbf {x} )</math> over unconstrained values of the real-vector <math> \mathbf {x} </math> where <math> f</math> is a differentiable scalar function. L-BFGS stores only fewer vectors that represent the approximation to the inverse hessian implicitly in comparison to the original BFGS. <br />
<br />
<br />
<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the data points selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One fascinating example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by an additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
==Critiques==<br />
<br />
Although this paper has presented very interesting results and makes a bridge between machine learning and classical computational physics, some questions are still unanswered. For example, how deep should the neural network be? How much data is needed? Why the optimizer is not suffering from being trapped at local optima for the parameters of the differential operators ? Can weight initialization and data normalization be improved? Why these methods seem to be very robust to noise in data? How can uncertainty in predictions be interpreted which hints us to the concept of interpretable AI. The answers to these questions can be next steps for this research direction.<br />
<br />
In this paper, a Quasi-Newton optimizer has been used to update parameters. Although they are more powerful that second order optimizers, however, due to their computational load, they are not the common choice in today's deep learning packages. Considering this, do the first order optimizers handle updating the weights in such a problem? Or they may get stuck in local minima? There is no such experiment in the paper.<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information on physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for the prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely available on GitHub [4]. It is quite easy to go through and learn - although unfortunately, it is written in TensorFlow v1.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=45399DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-20T06:07:16Z<p>Msikarou: </p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math>as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively.<br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. In a high-level perspective, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 800px]]<br />
<br />
<br />
Let's look at Dreamer in detail. It consists of:<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
There are three main components to the proposed algorithm:<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|500px|Dreamer algorithm]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment.<br />
<br />
== Critique ==<br />
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.<br />
<br />
In the model components in Appendix A, they have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization&diff=45383Meta-Learning For Domain Generalization2020-11-20T03:07:35Z<p>Msikarou: /* Critiques */</p>
<hr />
<div>== Presented by ==<br />
Parsa Ashrafi Fashi<br />
<br />
== Introduction ==<br />
<br />
Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.<br />
<br />
== Previous Work ==<br />
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].<br />
<br />
== Method ==<br />
In the DG setting, we assume there are S source domains <math> S </math> and T target domains <math> T </math> . We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> . <br />
<br />
The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.<br />
<br />
=== Supervised Learning ===<br />
<br />
First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.<br />
<br />
==== Meta-Train ====<br />
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math><br />
<br />
==== Meta-Test ====<br />
<br />
In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>. <br />
<br />
==== Final Objective Function ====<br />
<br />
Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach. <br />
<br />
[[File:ashraf1.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div><br />
<br />
=== Reinforcement Learning ===<br />
<br />
In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. Here, tasks map to return functions and domains map to different environments.<br />
==== Meta-Train ==== <br />
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is, <br />
\begin{align}<br />
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s<br />
\end{align}<br />
<br />
Then the optimal policy is obtained by minimizing <math> F </math>.<br />
<br />
==== Meta-Test ====<br />
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.<br />
[[File:ashraf2.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div><br />
<br />
==== Alternative Variants of MLDG ====<br />
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.<br />
\end{equation}<br />
<br />
Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2<br />
\end{equation}.<br />
<br />
== Experiments ==<br />
<br />
The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments). <br />
<br />
=== Illustrative Synthetic Experiment ===<br />
<br />
In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner.<br />
<br />
[[File:ashraf3.jpg |center|600px]]<br />
<br />
<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div><br />
<br />
=== Object Detection === <br />
For object detection, the PACS multi-domain recognition benchmark is exploited; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods. <br />
<br />
[[File:ashraf4.jpg |center|800px]]<br />
<br />
<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div><br />
<br />
=== Cartpole ===<br />
<br />
The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.<br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div><br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div><br />
<br />
=== Mountain Car ===<br />
<br />
In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.<br />
<br />
[[File:ashraf7.jpg |center|800px]]<br />
<br />
<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div><br />
<br />
== Conclusion ==<br />
<br />
This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.<br />
<br />
== Critiques ==<br />
<br />
I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.<br />
<br />
<br />
Why meta-learning makes the domain generalization to be domain agnostic? <br />
<br />
In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?<br />
<br />
The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets.<br />
<br />
== References ==<br />
<br />
[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.<br />
<br />
[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.<br />
<br />
[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.<br />
<br />
[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.<br />
<br />
[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.<br />
<br />
[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.<br />
<br />
[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization&diff=45382Meta-Learning For Domain Generalization2020-11-20T03:07:02Z<p>Msikarou: </p>
<hr />
<div>== Presented by ==<br />
Parsa Ashrafi Fashi<br />
<br />
== Introduction ==<br />
<br />
Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.<br />
<br />
== Previous Work ==<br />
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].<br />
<br />
== Method ==<br />
In the DG setting, we assume there are S source domains <math> S </math> and T target domains <math> T </math> . We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> . <br />
<br />
The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.<br />
<br />
=== Supervised Learning ===<br />
<br />
First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.<br />
<br />
==== Meta-Train ====<br />
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math><br />
<br />
==== Meta-Test ====<br />
<br />
In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>. <br />
<br />
==== Final Objective Function ====<br />
<br />
Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach. <br />
<br />
[[File:ashraf1.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div><br />
<br />
=== Reinforcement Learning ===<br />
<br />
In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. Here, tasks map to return functions and domains map to different environments.<br />
==== Meta-Train ==== <br />
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is, <br />
\begin{align}<br />
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s<br />
\end{align}<br />
<br />
Then the optimal policy is obtained by minimizing <math> F </math>.<br />
<br />
==== Meta-Test ====<br />
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.<br />
[[File:ashraf2.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div><br />
<br />
==== Alternative Variants of MLDG ====<br />
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.<br />
\end{equation}<br />
<br />
Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2<br />
\end{equation}.<br />
<br />
== Experiments ==<br />
<br />
The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments). <br />
<br />
=== Illustrative Synthetic Experiment ===<br />
<br />
In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner.<br />
<br />
[[File:ashraf3.jpg |center|600px]]<br />
<br />
<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div><br />
<br />
=== Object Detection === <br />
For object detection, the PACS multi-domain recognition benchmark is exploited; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods. <br />
<br />
[[File:ashraf4.jpg |center|800px]]<br />
<br />
<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div><br />
<br />
=== Cartpole ===<br />
<br />
The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.<br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div><br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div><br />
<br />
=== Mountain Car ===<br />
<br />
In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.<br />
<br />
[[File:ashraf7.jpg |center|800px]]<br />
<br />
<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div><br />
<br />
== Conclusion ==<br />
<br />
This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.<br />
<br />
== Critiques ==<br />
<br />
I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.<br />
<br />
<br />
1- Why meta-learning makes the domain generalization to be domain agnostic? <br />
2- In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?<br />
3- The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets. <br />
<br />
== References ==<br />
<br />
[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.<br />
<br />
[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.<br />
<br />
[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.<br />
<br />
[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.<br />
<br />
[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.<br />
<br />
[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.<br />
<br />
[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.</div>Msikarou