http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=N2chatti&feedformat=atomstatwiki - User contributions [US]2022-08-14T03:35:48ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_940-Proposal&diff=49884F21-STAT 940-Proposal2020-12-15T05:00:38Z<p>N2chatti: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
<br />
<br />
<br />
Project # 1 Group members:<br />
<br />
McWhannel, Pierre<br />
<br />
Yan, Nicole<br />
<br />
Hussein Salamah, Ahmed <br />
<br />
Title: Dense Retrieval for Conversational Information Seeking <br />
<br />
Description:<br />
One of the recognized problems in Information Retrieval (IR) is the conversational search that attracts much attention in form of Conversational Assistants such as Alexa, Siri and Cortana. The users’ needs are the ultimate goal of conversational search systems, in this context the questions are asked sequentially imposing a multi-turn format as the Conversational Information Seeking (CIS) task. TREC Conversational Assistance Track (CAsT) [3] is a multi-turn conversational search task as it contains a large-scale reusable test collection for sequences of conversational queries. The response of this conversational model is not a list of relevant documents, but it is limited to brief response passages with a length of 1 to 3 sentences in length.<br />
<br />
[[File:Screen Shot 2020-10-09 at 1.33.00 PM.png | 300px | Example Queries in CAsT]]<br />
<br />
In [4], the authors focus on improving open domain question answering by including dense representations for retrieval instead of the traditional methods. They have adopted a simple dual-encoder framework to construct a learnable retriever on large collections. We want to adopt this dense representation for the conversational model in the CAsT task and compare it with the performance of the other approaches in literature. The performance will be indicated by using graded relevance on five point, which are Fails to meet, Slightly meets, Moderately meets, Highly meets, and Fully meets.<br />
<br />
We aim to further improve our system performance by integrating the following techniques:<br />
<br />
• Paragraph-level pre-training tasks: ICT, BFS, and WLP [1]<br />
<br />
• ANCE training: periodically using checkpoints to encode documents, from which the strong negatives close to the relevant document would be used as next training negatives [5]<br />
<br />
In summary, this project is exploratory in nature as we will be trying to use state-of-art Dense Passage Retrieval techniques (based on BERT) [4, 6], in a question answering (QA) problem. Current first-stage-retrieval approaches mainly rely on bag-of-words models. In this project, we hope to explore the feasibility of using state-of-art methods such as BERT. We will first compare how these perform on the TREC CAsT datasets [3] against the results retrieved using BM25. After these first points of comparison we will next explore methods of improving DPR by exploring one or more techniques that are made to improve the performance of DPR. [1, 5].<br />
<br />
References<br />
<br />
[1] Wei-Cheng Chang et al. Pre-training Tasks for Embedding-based Large-scale Retrieval. 2020. arXiv: 2002.03932 [cs.LG].<br />
<br />
[2] Zhuyun Dai and Jamie Callan. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. 2019. arXiv: 1910.10687 [cs.IR].<br />
<br />
[3] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. TREC CAsT 2019: The Conversational Assistance Track Overview. 2020. arXiv: 2003.13624 [cs.IR].<br />
<br />
[4] Vladimir Karpukhin et al. Dense Passage Retrieval for Open-Domain Ques- tion Answering. 2020. arXiv: 2004.04906 [cs.CL].<br />
<br />
[5] Lee Xiong et al. Approximate Nearest Neighbor Negative Contrastive Learn- ing for Dense Text Retrieval. 2020. arXiv: 2007.00808 [cs.IR].<br />
<br />
[6] Jingtao Zhan et al. RepBERT: Contextualized Text Embeddings for First- Stage Retrieval. 2020. arXiv: 2006.15498 [cs.IR].<br />
<br />
<br />
<br />
Project # 2 Group members:<br />
<br />
Singh, Gursimran<br />
<br />
Sharma, Govind<br />
<br />
Chanana, Abhinav<br />
<br />
Title: Quick Text Description using Headline Generation and Text To Image Conversion<br />
<br />
Description: An automatic tool to generate short description based on long textual data is a useful mechanism to share quick information. Most of the current approaches involve summarizing the text using varied deep learning approaches from Transformers to different RNNs. For this project, instead of building a standard text summarizer, we aim to provide two separate utilities for generating a quick description of the text. First, we plan to develop a model that produces a headline for the long textual data, and second, we are intending to generate an image describing the text. <br />
<br />
Headline Generation - Headline generation is a specific case of text summarization where the output is generally a combination of few words that gives an overall outcome from the text. In most cases, text summarization is an unsupervised learning problem. But, for the headline generation, we have the original headlines available in our training dataset that makes it a supervised learning task. We plan to experiment with different Recurrent Neural Networks like LSTMs and GRUs with varied architectures. For model evaluation, we are considering BERTScore using which we can compare the reference headline with the automatically generated headline from the model. We also aim to explore Attention and Transformer Networks for the text (headline) generation. We will make use of the currently available techniques mentioned in the various research papers but also try to develop our own architecture if the previous methods don't reveal reliable results on our dataset. Therefore, this task would primarily fit under the category of application of deep learning to a particular domain, but could also include some components of new algorithm design.<br />
<br />
Text to Image Conversion - Generation or synthesis of images from a short text description is another very interesting application domain in deep learning. One approach for image generation is based on mapping image pixels to specific features as described by the discriminative feature representation of the text. Recurrent Neural Networks have been successfully used in learning such feature representations of text. This approach is difficult to generalize because the recognition of discriminative features for texts in different domains is not an easy task and it requires domain expertise. Different generative methods have been used including Variational Recurrent Auto-Encoders and its extension in Deep Recurrent Attention Writer (DRAW). We plan to experiment with Generative Adversarial Networks (GAN). Application of GANs on domain-specific datasets has been done but we aim to apply different variants of GANs on the Microsoft COCO dataset which has been used in other architectures. The analysis will be focusing on how well GANs are able to generalize when compared to other alternatives on the given dataset.<br />
<br />
Scope - The above models will be trained independently on different datasets. Therefore, for a particular text, only one of the two functionalities will be available.<br />
<br />
<br />
<br />
Project # 3 Group members:<br />
<br />
Sikri, Gaurav<br />
<br />
Bhatia, Jaskirat<br />
<br />
Title: Cassava Leaf Disease Classification (Kaggle Competition)<br />
<br />
Description: Cassava is a major food crop harvested by farmers in Africa due to its nature to withstand harsh conditions. But in the last few years, the frequency of viral diseases to leaves has become a major concern for the farmers. Currently-existing methods of disease discovery require farmers to solicit the help of government-funded agricultural experts to visually inspect and diagnose the plants. This is a lot challenging since it is very expensive and labor-intensive. <br />
This task is organized by The Makerere Artificial Intelligence (AI) Lab, which is an AI and Data Science research group based at Makerere University in Uganda. The ask is to predict the type of disease that the leaves have by looking at the image of the leaves. <br />
<br />
Kaggle link:- https://www.kaggle.com/c/cassava-leaf-disease-classification/overview<br />
<br />
<br />
<br />
Project # 4 Group members:<br />
<br />
Maleki, Danial<br />
<br />
Rasoolijaberi, Maral<br />
<br />
Title: Binary Deep Neural Network for the domain of Pathology<br />
<br />
Description: The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. We want to investigate the possibility of using these types of networks in the domain of histopathology as it has gigapixels images which make the use of them very useful.<br />
<br />
<br />
Project # 5 Group members:<br />
<br />
Jain, Abhinav<br />
<br />
Bathla, Gautam<br />
<br />
Title: Zero short learning with AREN and HUSE<br />
<br />
Description: Attention Region Discovery and Adaptive Thresholding module are taken from the idea of “Attentive Region Embedding Network for Zero-shot Learning” (https://openaccess.thecvf.com/content_CVPR_2019/papers/Xie_Attentive_Region_Embedding_Network_for_Zero-Shot_Learning_CVPR_2019_paper.pdf) whereas the idea for projecting image and text embeddings into a shared space was taken by “HUSE: Hierarchical Universal Semantic Embeddings” (https://arxiv.org/pdf/1911.05978.pdf). The motivation is that the attribute embedding can provide some complementary information to the model which can be learned to represent into a shared space and hence a better prediction to the zero-shot learning can be made. Also, the Squeeze and Excitation layer showed some impressive results when applied to the feature extraction part of the model, therefore we thought of re-weighting the channels first before applying the self-attention module so that the model can give even better attention to the image. The paper “Attentive Region Embedding Network for Zero-shot Learning” projected image features directly to the semantic space to make zero-shot predictions but HUSE showed that models can learn better when image and semantic features are projected to a shared space, therefore we wanted to see if the model can make use of this shared space and hence enhance the performance.<br />
<br />
[[File:a227jain-proposal.jpeg | 300px | Architecture diagram]]<br />
<br />
<br />
Project # 6 Group members:<br />
<br />
You, Bowen<br />
<br />
Wu, Mohan<br />
<br />
Title: Deep Learning Models in Volatility Forecasting<br />
<br />
Description: Price forecasting has become a very hot topic in the financial industry in recent years. We are however very interested in the volatility of such financial instruments. We propose a new deep learning architecture or model to predict volatility and apply our model to real life datasets of various financial products. We will analyze our results and compare them to more traditional methods.<br />
<br />
<br />
Project # 7 Group members:<br />
<br />
Chen, Meixi<br />
<br />
Shen, Wenyu<br />
<br />
Title: Through the Lens of Probability Theory: A Comparison Study of Bayesian Deep Learning Methods<br />
<br />
Description: Deep neural networks have been known as black box models, but they can be made less mysterious when adopting a Bayesian approach. From a Bayesian perspective, one is able to assign uncertainty on the weights instead of having single point estimates, which allows for a better interpretability of deep learning models. However, Bayesian deep learning methods are often intractable due an increase amount of parameters and often times don't have as good performance. In this project, we will study different BDL methods such as Bayesian CNN using variational inference and Laplace approximation, with applications on image classification, and we will try to propose improvements where possible.<br />
<br />
<br />
Project # 8 Group members:<br />
<br />
Avilez, Jose<br />
<br />
Title: A functional universal approximation theorem<br />
<br />
Description: In the seminal paper "Approximation by superpositions of a sigmoidal function", Cybenko gave a simple proof using elementary functional analysis that a certain class of functions, called discriminatory functions, serve as valid activation functions for universal neural approximators. The objective of our project is three-fold:<br />
<br />
1) Prove a converse of Cybenko's Universal Approximation Theorem by means of the Stone-Weierstrass theorem<br />
<br />
2) Provide examples and non-examples of Cybenko's discriminatory functions<br />
<br />
3) Construct a neural network for functional data (i.e. data arising in function spaces) and prove a universal approximation theorem for Lp spaces.<br />
<br />
References:<br />
<br />
[1] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.<br />
<br />
[2] Folland, Gerald B. Real analysis: modern techniques and their applications. Vol. 40. John Wiley & Sons, 1999.<br />
<br />
[3] Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences, 4.<br />
<br />
<br />
<br />
Project # 9 Group members:<br />
<br />
Sikaroudi, Milad<br />
<br />
Ashrafi Fashi, Parsa<br />
<br />
Title: '''Magnification Generalization with Model-Agnostic Semantic Features in Histopathology Images'''<br />
<br />
Many of the embedding methods learn the subspace for only a specific magnification. However, one of the main challenges in histopathology image embedding is the different magnification levels for indexing of a Whole Slide Indexing (WSI) image [1]. It is well-known that significantly different patterns may exist at different magnification levels of a WSI [2]. <br />
It is useful to train an embedding space for discriminating the histopathology patches regardless of their magnifications. That would lead to learning more compact WSI representations. It has been an arduous task because of the significant domain shifts between different magnification levels with noticeably different patterns. The performance of conventional deep neural networks tends to degrade in the presence of a domain shift, such as the gathering of data from different centers. In this study for the first time, we are going to introduce different magnification levels as a domain shift to see if we can generalize to in-common features in different magnification levels by means of a domain generalization technique, known as Model Agnostic Learning of Semantic Features. The hypothesis is that the statistics of retrieval for the model trained using episodic domain generalization will not degrade as much as the baseline when there is a domain shift. <br />
<br />
[1] Sellaro, Tiffany L., et al. "Relationship between magnification and resolution in digital pathology systems." Journal of pathology informatics 4 (2013).<br />
<br />
[2] Zaveri, Manit, et al. "Recognizing Magnification Levels in Microscopic Snapshots." arXiv preprint arXiv:2005.03748 (2020).<br />
<br />
<br />
Project # 10 Group members:<br />
<br />
Torabian, Parsa<br />
<br />
Ebrahimi Farsangi, Sina<br />
<br />
Moayyedi, Arash<br />
<br />
Title: Meta-Learning Regularizers for Few-Shot Classification Models<br />
<br />
Our project aims at exploring the effects of self-supervised pre-training on few-shot classification. We draw inspiration from the paper “When Does Self-supervision Improve Few-shot Learning?”[1] where the authors analyse the effects of using the Jigsaw puzzle[2] and rotation tasks as regularizers for training Prototypical Networks[3] and Model-Agnostic Meta-Learning (MAML)[4] networks. <br />
<br />
The introduced paper analyzes the effects of regularizing meta-learning models using self-supervised loss, based on rotation and Jigsaw tasks. We will apply PIRL [5], a contrastive self-supervised approach which recently set records, to the few-shot task. We will also look at other regularization techniques which are successful in classical deep learning regimes such as orthogonality regularization, to see their effects in few-shot learning. We will test our methods against ProtoNets and MAML, as representative of the metric and optimization-based meta-learning approaches respectively. <br />
<br />
References:<br />
<br />
[1] https://arxiv.org/abs/1910.03560<br />
<br />
[2] https://arxiv.org/abs/1603.09246<br />
<br />
[3] https://arxiv.org/abs/1703.05175 <br />
<br />
[4] https://arxiv.org/abs/1703.03400<br />
<br />
[5] https://arxiv.org/abs/1912.01991<br />
<br />
<br />
Project # 11 Group Members:<br />
<br />
Shikhar Sakhuja: s2sakhuj@uwaterloo.ca <br />
<br />
Introduction:<br />
<br />
Controller Area Network (CAN bus) is a vehicle bus standard that allows Electronic Control Units (ECU) within an automobile to communicate with each other without the need for a host computer. Modern automobiles might have up to 70 ECUs for various subsystems such as Engine, Transmission, Breaking, etc. The ECUs exchange messages on the CAN bus and allow for a lot of modern vehicle capabilities such as automatic start/stop, electric park brakes, lane detection, collision avoidance, and more. Each message exchanged on the bus is encoded as a 29-bit packet. These 29 bits consist of a combination of Parameter Group Number (PGN), message priority, and the source address of the message. Parameter groups can be, for example, engine temperature which could include coolant temperature, fuel temperature, etc. The PGN itself includes information such as priority, reserved status, data page, and PDU format. Lastly, the source address maps the message to the ECU it originates from. <br />
<br />
Goals:<br />
<br />
(1) This project aims to use messages exchanged on the CAN bus of a Challenger Truck collected by the Embedded Systems Group at the University of Waterloo. The data exists in a temporal format with a new message exchanged periodically. The goals of this project are two folds:<br />
<br />
(2) Predicting the PGN and source address of message N exchanged on the bus, given messages 1 to N-1. We might also explore predicting attributes within the PGN. <br />
Predicting the delay between messages N-1 and N, given the delay between each pair of consecutive messages leading up to message N-1. <br />
<br />
Potential Approach:<br />
<br />
For the first goal, we intend to experiment with RNN models along with Attention modules since they have shown promising results in text generation/prediction. <br />
<br />
The second goal is more of an investigative problem where we intend to use regression techniques powered by Neural Networks to predict delays between messages N-1 and N.<br />
<br />
<br />
<br />
<br />
<br />
Project # 12 Group members:<br />
<br />
Hemati, Sobhan <br />
<br />
Meaney, Cameron <br />
<br />
Title: Representation learning of gigapixel histopathology images using PointNet a permutation invariant neural network<br />
<br />
Description:<br />
<br />
In recent years, there has been a significant growth in the amount of information available in digital pathology archives. This data is valuable because of its potential uses in research, education, and pathologic diagnosis. As a result, representation learning of histopathology whole slide images (WSIs) has attracted significant attention and become an active area of research. Unfortunately, scientific progress with these data have been difficult because of challenges inherent to the data itself. These challenges include highly complex textures of different tissue types, color variations caused by different stainings, and most notably, the size of the images which are often larger than 50,000x50,000 pixels. Additionally, these images are multi-resolution meaning that each WSI may contain images from different zoom levels, primarily 5X, 10X, 20X, and 40X. With the advent of deep learning, there is optimism that these challenges can be overcome. The main challenge in this approach is that the sheer size of the images makes it infeasible (or impossible) to obtain a vector representation for a WSI, which is a necessary step in order to leverage deep learning algorithms. In practice, this is often bypassed by considering ‘patches’ of the WSI of smaller sizes, a set of which is meant to represent the full WSI. This approach lead to a set representation for a WSI. However, unlike traditional image or sequence models, deep networks that process and learn permutation invariant representations from sets is still a developing area of research. Recent attempts at this include Multi-instance Learning Schemes, Deep Set, and Set Transformers. A particularly successful attempt in developing a deep neural network for set representation in called PointNet which was developed for classification and segmentation of 3D objects and point clouds. In PointNet, each set is represented using a set of (x,y,z) coordinates, and the network is designed to learn a permutation invariant global representation for each set and then use this representation for classification or segmentation.<br />
<br />
In this project, we attempt to first extend the PointNet network to a convolutional PointNet network such that it uses a set of image patches rather than (x,y,z) coordinates to learn the universal permutation invariant representation. Then, we attempt improve the representational power of PointNet as a permutation invariant neural network. For the first part, the main challenge is that while PointNet has been designed for processing of sets with the same size, in WSIs, the size of the image and therefore number of patches is not fixed. For this reason, we will need to develop an idea which enables CNN-PointNet to process sets with different sizes. One possible solution is to use fake members to standardize the set size and then remove the effect of these fake members in backpropagation using a masking scheme. For the second part, the PointNet network can be improved in many ways. For example, the rotation matrix used is not a real rotation matrix as the orthogonality is incorporated using a regularization term. However, using a projected gradient technique and the existence of a closed form solution for obtaining nearest orthogonal matrix to a given matrix (Orthogonal Procrustes Problem) we can keep the exact orthogonality constraint and obtain a real rotation matrix. This exact orthogonality is geometrically important as, otherwise, this transformation will likely corrupt the neighborhood structure of the points in each set. Furthermore, PointNet uses very simple symmetric function (max pooling) as a set approximator, however there more powerful symmetric functions like statistical moments, power-sum with a trainable parameter, and other set approximators can be used. It would be interesting to see how more complicated symmetric functions can improve the representational power of PointNet to achieve more discriminative permutation invariant representations for each set (in this case WSIs).<br />
<br />
Project # 13 Group Members:<br />
<br />
Syed Saad Naseem ssnaseem@uwaterloo.ca<br />
<br />
Title: Text classification of topics related to COVID-19 on social media using deep learning<br />
The COVID-19 pandemic has become a public health emergency and a critical socioeconomic issue worldwide. It is changing the way we live and do business. Social media is a rich source of data about public opinion on different types of topics including topics about COVID-19. I plan on using Reddit to get a dataset of posts and comments from users related to COVID-19 and since Reddit is divided into communities so the posts and comments are also clustered by the topic of the community, for example, posts from the political subreddit will have posts about politics.<br />
<br />
I plan to make a classifier that will take a given text and will tell what the text of talking about for example it can be talking about politics, studies, relationships, etc. The goals of this project are to:<br />
<br />
• Scrape a dataset from Reddit from different communities<br />
<br />
• Train a deep learning model (CNN or RNN model) to classify a given text into the possible categories<br />
<br />
• Test the model on posts from social talking about COVID-19<br />
<br />
<br />
<br />
Project # 14 Group members<br />
<br />
Edwards, John<br />
<br />
Title: Click-through Rate Prediction Using Historical User Data<br />
<br />
Click-through Rate (CTR) prediction consists of forecasting a users probability of clicking on a specified target. CTR is used largely by online advertising systems which sell ad space on a cost-per-click pricing model to asses the likenesses of a user clicking on a targeted ad. <br />
<br />
User session logs provides firms with an assortment of individual specific features, a large - number of which are categorical. Additionally, advertisers posses multiple ad candidates each with their own respective features. The challenge of CTR prediction is to design a model which encompass the Interacting effects of these features to produced high quality forecasts and pair users with advertisements with high potential for click conversion. Additionally computational efficiency must balanced with model complexity so that predictions can be done in an online setting throughout the progression of a users session.<br />
<br />
This projects primary objective will be to attempt creating a new Deep Neural Network (DNN) architecture for producing high quality CTR forecasts while also satisfying the aforementioned challenges.<br />
<br />
While many variants of DNN for CTR predictions exists they can differ greatly in application setting. Specifically, the vast majority of models evaluate each user-ad interaction independently. They fail to utlise information contained for each specific users’ historical add impressions. There is only a small subset of models [1,2,4] which have tried to address this by adapting architectures to utilize historical information. This projects focus will be within this application setting exploring new architectures which can better utilise information contained within a users historical behaviour. <br />
<br />
This projects implementation will consist of the following action plan:<br />
Develop a new model architecture inspired by innovations of previous CTR network designs which lacked the ability to adapt their model to utlize a users historical data [4,5].<br />
Use the public benchmark Avito advertising dataset to empirically evaluate the new models performance and compare it against previous state of the art models for this data set. <br />
<br />
References:<br />
<br />
[1] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Li, Li & Liu, Zhaojie & Du, Yanlong. (2019). Click-Through Rate Prediction with the User Memory Network. <br />
<br />
[2] Ouyang, Wentao & Zhang, Xiuwu & Li, Li & Zou, Heng & Xing, Xin & Liu, Zhaojie & Du, Yanlong. (2019). Deep Spatio-Temporal Neural Networks for Click-Through Rate Prediction. 2078-2086. 10.1145/3292500.3330655. <br />
<br />
[3] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Qi, Chao & Liu, Zhaojie & Du, Yanlong. (2019). Representation Learning-Assisted Click-Through Rate Prediction. 4561-4567. 10.24963/ijcai.2019/634. <br />
<br />
[4] Li, Zeyu, Wei Cheng, Yang Chen, H. Chen and W. Wang. “Interpretable Click-Through Rate Prediction through Hierarchical Attention.” Proceedings of the 13th International Conference on Web Search and Data Mining (2020)<br />
<br />
[5] Zhou, Guorui & Gai, Kun & Zhu, Xiaoqiang & Song, Chenru & Fan, Ying & Zhu, Han & Ma, Xiao & Yan, Yanghui & Jin, Junqi & Li, Han. (2018). Deep Interest Network for Click-Through Rate Prediction. 1059-1068. 10.1145/3219819.3219823.<br />
<br />
<br />
Project #15 Group members<br />
<br />
Donya Hamzeian<br />
Maziar Dadbin<br />
<br />
Title: Mechanisms of Action (MoA) Prediction(Kaggle project)<br />
<br />
<br />
Description: This project is organized by the Laboratory for Innovation Science at Harvard with the goal of advancing drug development through improvements to MoA prediction algorithms. <br />
<br />
What is the Mechanism of Action (MoA) of a drug? And why is it important?<br />
<br />
In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.<br />
<br />
Here is a brief explanation of how the MoA of a drug is determined:<br />
The approach used in this project to determine the MoA is by testing the drug on a sample of human cells, including control and treatment groups, with different doses. Then, after a certain period of time, their cellular responses, i.e. gene expressions and cell viability, were measured which are deterministic of the drug's mechanisms of action and comprise the feature. The mechanisms of action data, including 206 columns, are binary targets that in this project we aim to predict. Therefore, the whole project is a multilabel binary classification. The data used for training consists of cellular responses to different combinations of drugs and human cell samples with their MoA's which are the columns to be predicted in the test data.<br />
<br />
<br />
Project # 16 <br />
<br />
Group members:<br />
<br />
Nouha Chatti<br />
<br />
Title: Detecting contradiction and entailment in multilingual text<br />
<br />
Description:<br />
The project addresses a Natural Language Inference (NLI) problem. NLI is a task of Natural Language Processing which recognizes textual contradiction and entailment. This project takes place in the framework of a Kaggle competition. The goal consists in implementing a multilingual model able to accurately classify a hypothesis to the following classes: entailment, contradiction or neutral, based on a given premise. This solution can be very useful in various domain. For instance, it can be used to detect contradictions and conflicting sentences in text in general or to identify defiance and inconsistency in political statements or even to analyze users’ opinions and customers’ reviews posted on social platforms.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=49842Time-series Generative Adversarial Networks2020-12-08T05:10:25Z<p>N2chatti: /* References */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models. GANs are being used to generate time-series data in different sectors such as healthcare, energy and finance. Researchers have conditioned them for various use cases, from handling irregular sampling <sup>[13]</sup> to building probabilistic forecasting models for univariate time-series.<sup>[14]</sup><br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs along with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup>, based on curriculum learning <sup>[2]</sup>, where the models are trained to output based on a combination of ground truth and previous outputs. Another method inspired by adversarial domain adaptation is training an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>. Approach based on Actor-critic methods <sup>[5]</sup> have also been proposed that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions <sup>[11]</sup>. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics.<br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain constant over the entire time-series, or for a long period of time) and temporal features (variables that changes with respect to time). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, inputs to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution <math>p</math>. The objective of a generative model is to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how the information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form:<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form:<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture.<br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>\lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). The authors' results show that TimeGAN is insensitive to the two parameters <math>\lambda </math> and <math> \eta </math> so they decided to set <math> \lambda = 1 </math> and <math> \eta=10 </math> for all experiments. Compared to normal GAN, TimeGAN does not increase the difficulty in training.<br />
<br />
The following methods are used for benchmarking and evaluation:<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate Gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses multiple datasets in different areas like Sines, Stocks, Energy, and Events with different methods(TimeGAN, RCGANM CRANNGANM, and etc), and visualize their performance between original data and synthetic. <br />
<br />
===Sines===<br />
They simulated multivariate sinusoidal sequences of different frequencies η and phases θ, providing continuous-valued, periodic, multivariate data where each feature is independent of others.<br />
<br />
===Stocks===<br />
By contrast, sequences of stock prices are continuous-valued but aperiodic; furthermore, features are correlated with each other. They use the daily historical Google stocks data from 2004 to 2019, including as features the volume and high, low, opening, closing, and adjusted closing prices.<br />
<br />
===Energy===<br />
They consider a dataset characterized by noisy periodicity, higher dimensionality, and correlated features. The UCI Appliances energy prediction dataset consists of multivariate, continuous-valued measurements including numerous temporal features measured at close intervals.<br />
<br />
===Events===<br />
Finally, they considered a dataset characterized by discrete values and irregular time stamps. They used a large private lung cancer pathways dataset consisting of sequences of events and their times, and model both the one-hot encoded sequence of event types and the event timings.<br />
<br />
Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
We could use traditional models like ARMA, ARIMA, and etc to analyze time-series type data. Also, the longitudinal data could be predicted by the generated additive model, linear mixed affected model to do both the feature selection and independent variable predicting. The author of this paper introduced an advanced architecture to generate new time-series data that combines the flexibility of unsupervised learning, and the control of the quality of the generated data by supervised learning.<br />
<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce real-time sequences with differential privacy guarantees.<br />
<br />
== Critique ==<br />
The method introduced in this paper is truly a novel one. The idea of enhancing the unsupervised components of a GAN with some supervised element has shown significant jumps in certain evaluations. I think the methods of evaluation used in this paper namely, t-SNE/PCA analysis (visualization), discriminative score, and predictive score; are very appropriate for this sort of analysis where the focus is on multiple things (generative accuracy and conditional dependence) both quantitatively and qualitatively. Other related works <sup>[9]</sup> have also used the same evaluation setup.<br />
<br />
The idea of the synthesized time-series being useful in terms of its predictive ability is good, especially in practice. But I think when the authors set out to create a model that can learn the temporal dynamics between time-series data then there could have been some additional metric that could better evaluate if the underlying temporal relations have been captured by the model or not. I feel the addition of some form of temporal correlation analysis would have added to the completeness of the paper.<br />
<br />
The enhancement of traditional GAN by simply adding an extra loss function to the mix is quite elegant. TimeGAN uses a stepwise supervised loss. The authors have also used very common choices for the various components of the overall TimeGAN network. This leaves a lot of possibilities in this area as many direct and indirect variations of TimeGAN or other architectures inspired by TimeGAN can be developed in a very straightforward manner of hyper-parameterizing the building blocks. <br />
<br />
TimeGAN benefits from merging supervised and unsupervised learning to create their generations while other methods in the literature benefit from learning their conditional input to create its generations. I believe after even considering the supervised and unsupervised learning, the way that the authors introduced temporal embeddings to assist network training is not designed well for anomaly detection (outlier detection) as it is only designed for time series representation learning as discussed in [10].<br />
<br />
The paper certainly proposes a novel approach to analyzing time series data, but there are concerns about the way the model is tested in practice. First, if the data is generated from a <math>VAR(1)</math> model, why would the authors would not use a multi-dimensional auto-ARIMA procedure, or a Box-Jenkins approach, to fit a model to their synthetic dataset. Moreover, as has been studied in the M4 competitions (see e.g. https://www.sciencedirect.com/science/article/pii/S0169207019301128), the ability of complex ML models or deep learning models to beat linear models, in general, is questionable. The theoretical reason for this empirical finding is that the Wold decomposition theorem says that a stationary process can be decomposed into the sum of a deterministic process and linear process, which gives a lot of credence to the ARIMA model. It would be highly beneficial if the authors included the Box-Jenkins benchmark in their experiments as well as testing their model against real data to see if it actually performs well.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018<br />
<br />
[9] Hao Ni, L. Szpruch, M. Wiese, S. Liao, Baoren Xiao. Conditional Sig-Wasserstein GANs for Time Series Generation, 2020<br />
<br />
[10] Geiger, Alexander et al. “TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks.” ArXiv abs/2009.07769 (2020): n. pag.<br />
<br />
[11] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.<br />
<br />
[12] Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. "The M4 Competition: 100,000 time series and 61 forecasting methods." International Journal of Forecasting 36.1 (2020): 54-74.<br />
<br />
[13] Ramponi, Giorgia and Protopapas, Pavlos and Brambilla, Marco and Janssen, Ryan. (2018) “T-cgan: Conditional generative adversarial network for data augmentation in noisy time series with irregular sampling.” arXiv preprint arXiv:1811.08295.<br />
<br />
[14] Koochali, A., Schichtel, P., Dengel, A., Ahmed, S.: Probabilistic forecasting of sensory data with generative adversarial networks forgan. IEEE Access 7, 63868–63880 (2019)</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=49841Time-series Generative Adversarial Networks2020-12-08T05:06:51Z<p>N2chatti: /* Introduction */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models. GANs are being used to generate time-series data in different sectors such as healthcare, energy and finance. Researchers have conditioned them for various use cases, from handling irregular sampling <sup>[13]</sup> to building probabilistic forecasting models for univariate time-series.<sup>[14]</sup><br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs along with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup>, based on curriculum learning <sup>[2]</sup>, where the models are trained to output based on a combination of ground truth and previous outputs. Another method inspired by adversarial domain adaptation is training an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>. Approach based on Actor-critic methods <sup>[5]</sup> have also been proposed that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions <sup>[11]</sup>. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics.<br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain constant over the entire time-series, or for a long period of time) and temporal features (variables that changes with respect to time). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, inputs to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution <math>p</math>. The objective of a generative model is to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how the information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form:<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form:<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture.<br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>\lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). The authors' results show that TimeGAN is insensitive to the two parameters <math>\lambda </math> and <math> \eta </math> so they decided to set <math> \lambda = 1 </math> and <math> \eta=10 </math> for all experiments. Compared to normal GAN, TimeGAN does not increase the difficulty in training.<br />
<br />
The following methods are used for benchmarking and evaluation:<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate Gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses multiple datasets in different areas like Sines, Stocks, Energy, and Events with different methods(TimeGAN, RCGANM CRANNGANM, and etc), and visualize their performance between original data and synthetic. <br />
<br />
===Sines===<br />
They simulated multivariate sinusoidal sequences of different frequencies η and phases θ, providing continuous-valued, periodic, multivariate data where each feature is independent of others.<br />
<br />
===Stocks===<br />
By contrast, sequences of stock prices are continuous-valued but aperiodic; furthermore, features are correlated with each other. They use the daily historical Google stocks data from 2004 to 2019, including as features the volume and high, low, opening, closing, and adjusted closing prices.<br />
<br />
===Energy===<br />
They consider a dataset characterized by noisy periodicity, higher dimensionality, and correlated features. The UCI Appliances energy prediction dataset consists of multivariate, continuous-valued measurements including numerous temporal features measured at close intervals.<br />
<br />
===Events===<br />
Finally, they considered a dataset characterized by discrete values and irregular time stamps. They used a large private lung cancer pathways dataset consisting of sequences of events and their times, and model both the one-hot encoded sequence of event types and the event timings.<br />
<br />
Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
We could use traditional models like ARMA, ARIMA, and etc to analyze time-series type data. Also, the longitudinal data could be predicted by the generated additive model, linear mixed affected model to do both the feature selection and independent variable predicting. The author of this paper introduced an advanced architecture to generate new time-series data that combines the flexibility of unsupervised learning, and the control of the quality of the generated data by supervised learning.<br />
<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce real-time sequences with differential privacy guarantees.<br />
<br />
== Critique ==<br />
The method introduced in this paper is truly a novel one. The idea of enhancing the unsupervised components of a GAN with some supervised element has shown significant jumps in certain evaluations. I think the methods of evaluation used in this paper namely, t-SNE/PCA analysis (visualization), discriminative score, and predictive score; are very appropriate for this sort of analysis where the focus is on multiple things (generative accuracy and conditional dependence) both quantitatively and qualitatively. Other related works <sup>[9]</sup> have also used the same evaluation setup.<br />
<br />
The idea of the synthesized time-series being useful in terms of its predictive ability is good, especially in practice. But I think when the authors set out to create a model that can learn the temporal dynamics between time-series data then there could have been some additional metric that could better evaluate if the underlying temporal relations have been captured by the model or not. I feel the addition of some form of temporal correlation analysis would have added to the completeness of the paper.<br />
<br />
The enhancement of traditional GAN by simply adding an extra loss function to the mix is quite elegant. TimeGAN uses a stepwise supervised loss. The authors have also used very common choices for the various components of the overall TimeGAN network. This leaves a lot of possibilities in this area as many direct and indirect variations of TimeGAN or other architectures inspired by TimeGAN can be developed in a very straightforward manner of hyper-parameterizing the building blocks. <br />
<br />
TimeGAN benefits from merging supervised and unsupervised learning to create their generations while other methods in the literature benefit from learning their conditional input to create its generations. I believe after even considering the supervised and unsupervised learning, the way that the authors introduced temporal embeddings to assist network training is not designed well for anomaly detection (outlier detection) as it is only designed for time series representation learning as discussed in [10].<br />
<br />
The paper certainly proposes a novel approach to analyzing time series data, but there are concerns about the way the model is tested in practice. First, if the data is generated from a <math>VAR(1)</math> model, why would the authors would not use a multi-dimensional auto-ARIMA procedure, or a Box-Jenkins approach, to fit a model to their synthetic dataset. Moreover, as has been studied in the M4 competitions (see e.g. https://www.sciencedirect.com/science/article/pii/S0169207019301128), the ability of complex ML models or deep learning models to beat linear models, in general, is questionable. The theoretical reason for this empirical finding is that the Wold decomposition theorem says that a stationary process can be decomposed into the sum of a deterministic process and linear process, which gives a lot of credence to the ARIMA model. It would be highly beneficial if the authors included the Box-Jenkins benchmark in their experiments as well as testing their model against real data to see if it actually performs well.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018<br />
<br />
[9] Hao Ni, L. Szpruch, M. Wiese, S. Liao, Baoren Xiao. Conditional Sig-Wasserstein GANs for Time Series Generation, 2020<br />
<br />
[10] Geiger, Alexander et al. “TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks.” ArXiv abs/2009.07769 (2020): n. pag.<br />
<br />
[11] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.<br />
<br />
[12] Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. "The M4 Competition: 100,000 time series and 61 forecasting methods." International Journal of Forecasting 36.1 (2020): 54-74.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=49840Time-series Generative Adversarial Networks2020-12-08T05:06:05Z<p>N2chatti: /* Introduction */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models. GANs are being used to generate time-series data in different sectors such as healthcare, energy and finance. Researchers have conditioned them for various use cases, from handling irregular sampling [13] to building probabilistic forecasting models for univariate time-series. [14]<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs along with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup>, based on curriculum learning <sup>[2]</sup>, where the models are trained to output based on a combination of ground truth and previous outputs. Another method inspired by adversarial domain adaptation is training an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>. Approach based on Actor-critic methods <sup>[5]</sup> have also been proposed that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions <sup>[11]</sup>. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics.<br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain constant over the entire time-series, or for a long period of time) and temporal features (variables that changes with respect to time). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, inputs to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution <math>p</math>. The objective of a generative model is to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how the information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form:<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form:<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture.<br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>\lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). The authors' results show that TimeGAN is insensitive to the two parameters <math>\lambda </math> and <math> \eta </math> so they decided to set <math> \lambda = 1 </math> and <math> \eta=10 </math> for all experiments. Compared to normal GAN, TimeGAN does not increase the difficulty in training.<br />
<br />
The following methods are used for benchmarking and evaluation:<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate Gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses multiple datasets in different areas like Sines, Stocks, Energy, and Events with different methods(TimeGAN, RCGANM CRANNGANM, and etc), and visualize their performance between original data and synthetic. <br />
<br />
===Sines===<br />
They simulated multivariate sinusoidal sequences of different frequencies η and phases θ, providing continuous-valued, periodic, multivariate data where each feature is independent of others.<br />
<br />
===Stocks===<br />
By contrast, sequences of stock prices are continuous-valued but aperiodic; furthermore, features are correlated with each other. They use the daily historical Google stocks data from 2004 to 2019, including as features the volume and high, low, opening, closing, and adjusted closing prices.<br />
<br />
===Energy===<br />
They consider a dataset characterized by noisy periodicity, higher dimensionality, and correlated features. The UCI Appliances energy prediction dataset consists of multivariate, continuous-valued measurements including numerous temporal features measured at close intervals.<br />
<br />
===Events===<br />
Finally, they considered a dataset characterized by discrete values and irregular time stamps. They used a large private lung cancer pathways dataset consisting of sequences of events and their times, and model both the one-hot encoded sequence of event types and the event timings.<br />
<br />
Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
We could use traditional models like ARMA, ARIMA, and etc to analyze time-series type data. Also, the longitudinal data could be predicted by the generated additive model, linear mixed affected model to do both the feature selection and independent variable predicting. The author of this paper introduced an advanced architecture to generate new time-series data that combines the flexibility of unsupervised learning, and the control of the quality of the generated data by supervised learning.<br />
<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce real-time sequences with differential privacy guarantees.<br />
<br />
== Critique ==<br />
The method introduced in this paper is truly a novel one. The idea of enhancing the unsupervised components of a GAN with some supervised element has shown significant jumps in certain evaluations. I think the methods of evaluation used in this paper namely, t-SNE/PCA analysis (visualization), discriminative score, and predictive score; are very appropriate for this sort of analysis where the focus is on multiple things (generative accuracy and conditional dependence) both quantitatively and qualitatively. Other related works <sup>[9]</sup> have also used the same evaluation setup.<br />
<br />
The idea of the synthesized time-series being useful in terms of its predictive ability is good, especially in practice. But I think when the authors set out to create a model that can learn the temporal dynamics between time-series data then there could have been some additional metric that could better evaluate if the underlying temporal relations have been captured by the model or not. I feel the addition of some form of temporal correlation analysis would have added to the completeness of the paper.<br />
<br />
The enhancement of traditional GAN by simply adding an extra loss function to the mix is quite elegant. TimeGAN uses a stepwise supervised loss. The authors have also used very common choices for the various components of the overall TimeGAN network. This leaves a lot of possibilities in this area as many direct and indirect variations of TimeGAN or other architectures inspired by TimeGAN can be developed in a very straightforward manner of hyper-parameterizing the building blocks. <br />
<br />
TimeGAN benefits from merging supervised and unsupervised learning to create their generations while other methods in the literature benefit from learning their conditional input to create its generations. I believe after even considering the supervised and unsupervised learning, the way that the authors introduced temporal embeddings to assist network training is not designed well for anomaly detection (outlier detection) as it is only designed for time series representation learning as discussed in [10].<br />
<br />
The paper certainly proposes a novel approach to analyzing time series data, but there are concerns about the way the model is tested in practice. First, if the data is generated from a <math>VAR(1)</math> model, why would the authors would not use a multi-dimensional auto-ARIMA procedure, or a Box-Jenkins approach, to fit a model to their synthetic dataset. Moreover, as has been studied in the M4 competitions (see e.g. https://www.sciencedirect.com/science/article/pii/S0169207019301128), the ability of complex ML models or deep learning models to beat linear models, in general, is questionable. The theoretical reason for this empirical finding is that the Wold decomposition theorem says that a stationary process can be decomposed into the sum of a deterministic process and linear process, which gives a lot of credence to the ARIMA model. It would be highly beneficial if the authors included the Box-Jenkins benchmark in their experiments as well as testing their model against real data to see if it actually performs well.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018<br />
<br />
[9] Hao Ni, L. Szpruch, M. Wiese, S. Liao, Baoren Xiao. Conditional Sig-Wasserstein GANs for Time Series Generation, 2020<br />
<br />
[10] Geiger, Alexander et al. “TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks.” ArXiv abs/2009.07769 (2020): n. pag.<br />
<br />
[11] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.<br />
<br />
[12] Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. "The M4 Competition: 100,000 time series and 61 forecasting methods." International Journal of Forecasting 36.1 (2020): 54-74.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=49839Self-Supervised Learning of Pretext-Invariant Representations2020-12-08T05:02:22Z<p>N2chatti: /* Introduction */</p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations by using a large number of data with pre-defined semantic annotations. Examples of these annotations include class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data which is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community in finding image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''self-supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show in self-supervised learning, there is no need for using human-provided class labels or bounding boxes for classification and object detection tasks, respectively. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of transformation used. Two of the most common pretext tasks used are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. The jigsaw task is more complicated than the rotation prediction task; first unlabeled images are cropped into 9 patches, then the image is perturbed by randomly permuting the nine patches. The unlabeled original image is referred to as the anchor data point (Figure 3-a), the reshuffled image that we get by permuting patches will be our positive sample (Figure 3-b) and the rest of the images in the dataset will be considered as negative samples. Each permutation falls into one of the 35 classes according to a formula given by the authors. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization where the model tries to revert the colors of a colored image turned to grayscale, and image reconstruction where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
[[File:figure3.jpg |600px | center]]<br />
<div align="center">'''Figure 3:''' Jigsaw puzzle used as a pretext task in unsupervised representation learning. (a) Original image (b) augmented image </div><br />
Although the proposed pretext tasks have achieved promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations that are common between the original images and the transformed ones. This idea is supported by the fact that humans can recognize these transformed images. This leads us to try to develop a method that obtains image representations that are common between the original and transformed images meaning the image representations are transformation invariant. The paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that obtains Self-Supervised image representations as opposed to Pretext Tasks which are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image , <math>I</math>, in the dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t))<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two sets of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t})}{\tau} \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t})}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}})}{\tau} \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from the dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers), <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math>, increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, is increased. According to the previous work, an infeasibly large batch size is needed to obtain a large number of negatives. To tackle this problem, a memory bank [9], <math>M</math>, is used during training which contains feature representation <math>m_I</math> for each image in the dataset including the negative images. The proposed PIRL model is shown in Figure 4. Finally, the contrastive loss in equation \eqref{eqn:5} does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in the last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from the ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
A Faster R-CNN model with a ResNet-50 backbone, pre-trained using PIRL and other Self-Supervised methods, is employed for the object detection task. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on the VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised-based methods, '''for the first time it outperforms Supervised Pre-training on object detection'''. They emphasize that PIRL achieves this result using the same backbone model, the same number of finetuning epochs, and the exact same pre-training data (but without the labels). This result is a substantial improvement over prior self-supervised approaches that obtain worse performance than fully supervised baselines despite using orders of magnitude more curated training data or much larger backbone models. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the pre-trained ResNet-50 model is utilized as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results demonstrate that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pre-trained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, from Figure6, we can observe that PIRL has the best performance among different Self-Supervised Learning methods. Moreover, PIRL can even perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset, and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 800px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of images transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL can use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using the VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure 10, increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images, and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. One of the clustering-based methods is '''DeepCluster''' [7], where each previous version of its representation is used by bootstrapping to produce a target for the next representation. They built a new representation through clustering data points using the prior representation and then classify the target by using the clustered index of each sample. That may result in better image representations. This will avoid the use of negative pairs but it might also cause collapsing to trivial solutions which create a trade-off. <br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.<br />
<br />
[9] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:figure3.jpg&diff=49838File:figure3.jpg2020-12-08T05:00:52Z<p>N2chatti: </p>
<hr />
<div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=49837Self-Supervised Learning of Pretext-Invariant Representations2020-12-08T04:59:25Z<p>N2chatti: /* Introduction */</p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations by using a large number of data with pre-defined semantic annotations. Examples of these annotations include class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data which is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community in finding image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''self-supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show in self-supervised learning, there is no need for using human-provided class labels or bounding boxes for classification and object detection tasks, respectively. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of transformation used. Two of the most common pretext tasks used are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. The jigsaw task is more complicated than the rotation prediction task; first unlabeled images are cropped into 9 patches, then the image is perturbed by randomly permuting the nine patches. The unlabeled original image is referred to as the anchor data point (Figure 3-a), the reshuffled image that we get by permuting patches will be our positive sample (Figure 3-b) and the rest of the images in the dataset will be considered as negative samples. Each permutation falls into one of the 35 classes according to a formula given by the authors. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization where the model tries to revert the colors of a colored image turned to grayscale, and image reconstruction where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
[[File:figure3.jpg |1000px | center]]<br />
<div align="center">'''Figure 3:''' Jigsaw puzzle used as a pretext task in unsupervised representation learning. (a) Original image (b) augmented image </div><br />
Although the proposed pretext tasks have achieved promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations that are common between the original images and the transformed ones. This idea is supported by the fact that humans can recognize these transformed images. This leads us to try to develop a method that obtains image representations that are common between the original and transformed images meaning the image representations are transformation invariant. The paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that obtains Self-Supervised image representations as opposed to Pretext Tasks which are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image , <math>I</math>, in the dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t))<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two sets of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t})}{\tau} \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t})}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}})}{\tau} \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from the dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers), <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math>, increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, is increased. According to the previous work, an infeasibly large batch size is needed to obtain a large number of negatives. To tackle this problem, a memory bank [9], <math>M</math>, is used during training which contains feature representation <math>m_I</math> for each image in the dataset including the negative images. The proposed PIRL model is shown in Figure 4. Finally, the contrastive loss in equation \eqref{eqn:5} does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in the last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from the ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
A Faster R-CNN model with a ResNet-50 backbone, pre-trained using PIRL and other Self-Supervised methods, is employed for the object detection task. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on the VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised-based methods, '''for the first time it outperforms Supervised Pre-training on object detection'''. They emphasize that PIRL achieves this result using the same backbone model, the same number of finetuning epochs, and the exact same pre-training data (but without the labels). This result is a substantial improvement over prior self-supervised approaches that obtain worse performance than fully supervised baselines despite using orders of magnitude more curated training data or much larger backbone models. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the pre-trained ResNet-50 model is utilized as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results demonstrate that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pre-trained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, from Figure6, we can observe that PIRL has the best performance among different Self-Supervised Learning methods. Moreover, PIRL can even perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset, and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 800px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of images transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL can use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using the VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure 10, increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images, and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. One of the clustering-based methods is '''DeepCluster''' [7], where each previous version of its representation is used by bootstrapping to produce a target for the next representation. They built a new representation through clustering data points using the prior representation and then classify the target by using the clustered index of each sample. That may result in better image representations. This will avoid the use of negative pairs but it might also cause collapsing to trivial solutions which create a trade-off. <br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.<br />
<br />
[9] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval&diff=49836Pre-Training Tasks For Embedding-Based Large-Scale Retrieval2020-12-08T04:54:52Z<p>N2chatti: /* Critique */</p>
<hr />
<div><br />
==Contributors==<br />
Pierre McWhannel wrote this summary with editorial and technical contributions from STAT 946 fall 2020 classmates. The summary is based on the paper "Pre-Training Tasks for Embedding-Based Large-Scale Retrieval" which was presented at ICLR 2020. The authors of this paper are Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar [1].<br />
<br />
==Introduction==<br />
The problem domain in which the paper is positioned is large-scale query-document retrieval. This problem is: given a query/question to collect relevant documents which contain the answer(s) and identify the span of words which contain the answer(s). This problem is often separated into two steps: 1) a retrieval phase which reduces a large corpus to a much smaller subset. Variants of the classical probabilistic retrieval model BM25 are usually used during this phase to rapidly filter a set of documents. BM25 is considered as one of strongest baseline algorithms for IR and is known for being highly effective 2) a reader which is applied to read the shortened list of documents and score them based on their ability to answer the query and identify the span containing the answer, these spans can be ranked according to some scoring function. In the setting of an open-domain question answering, the query, <math>q</math>, is a question and the documents <math>d</math> represent a passage which may contain the answer(s). The focus of this paper is on the retriever question answering (QA), in this setting a scoring function <math>f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} </math> is utilized to map <math>(q,d) \in \mathcal{X} \times \mathcal{Y}</math> to a real value which represents the score of how relevant the passage is to the query. We desire high scores for relevant passages and low scores otherwise and this is the job of the retriever. <br />
<br />
<br />
Certain characteristics desired of the retriever are high recall since the reader will only be applied to a subset meaning there is an opportunity to identify false positives but not false negatives. The other characteristic desired of a retriever is low latency meaning it is computationally efficient to be used for real-world applications. The popular approach to meet these heuristics has been the BM-25 [2] which relies on token-based matching between two high-dimensional and sparse vectors representing <math>q</math> and <math>d</math>. This approach performs well and retrieval can be performed in time sublinear to the number of passages with inverter indexing. The downfall of this type of algorithm is its inability to be optimized for a specific task. An alternate option is BERT [3] or transformer-based models [4] with cross attention between query and passage pairs which can be optimized for a specific task. For example, BERT assigned both masked language model and Next Sentence Prediction, pre-trained on various open resources data(BookCorpus + English WIKIPEDIA, CC NEWS, OpenWebtex, and Stories) with total of 160 GB. However, these models suffer from latency as the retriever needs to process all the pairwise combinations of a query with all passages. The next type of algorithm is an embedding-based model that jointly embeds the query and passage in the same embedding space and then uses measures such as cosine distance, inner product, or even the Euclidean distance in this space to get a score. The authors suggest the two-tower models can capture deeper semantic relationships in addition to being able to be optimized for specific tasks. This model is referred to as the "two-tower retrieval model", since it has two transformer-based models where one embeds the query <math>\phi{(\cdot)}</math> and the other the passage <math>\psi{(\cdot)}</math>, then the embeddings can be scored by <math>f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. This model can avoid the latency problems of a cross-layer model by pre-computing <math>\psi{(d)}</math> beforehand and then by utilizing efficient approximate nearest neighbor search algorithms in the embedding space to find the nearest documents.<br />
<br />
<br />
These two-tower retrieval models often use pre-trained weights from BERT. In BERT pre-trained model, the weight is obtained by the pre-training tasks are masked-LM (MLM) and next sentence prediction (NSP). The authors note that the research of pre-training tasks that improve the performance of two-tower retrieval models is an unsolved research problem. This is exactly what the authors have done in this paper. That is, to develop pre-training tasks for the two-tower model based on heuristics and then validated by experimental results. The pre-training tasks suggested are Inverse Cloze Task (ICT), Body First Search (BFS), Wiki Link Prediction (WLP), and combining all three. The authors used the Retrieval Question-Answering ('''ReQA''') benchmark [5] and used two datasets '''SQuAD'''[6] and '''Natural Questions'''[7] for training and evaluating their models.<br />
<br />
<br />
<br />
===Contributions of this paper===<br />
*The two-tower transformer model with proper pre-training can significantly outperform the widely used BM-25 Algorithm.<br />
*Paragraph-level pre-training tasks such as ICT, BFS, and WLP hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains.<br />
*The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart (BoW-MLP)<br />
<br />
==Background on Two-Tower Retrieval Models==<br />
This section is to present the two-tower model in more detail. As seen previously <math>q \in \mathcal{X}</math>, <math> d \in \mathcal{y}</math> are the query and document or passage respectively. The two-tower retrieval model consists of two encoders: <math>\phi: \mathcal{X} \rightarrow \mathbb{R}^k</math> and <math>\psi: \mathcal{Y} \rightarrow \mathbb{R}^k</math>. The tokens in <math>\mathcal{X},\mathcal{Y}</math> are a sequence of tokens representing the query and document respectively. The scoring function is <math> f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. The cross-attention BERT-style model would have <math> f_{\theta,w}(q,d) = \psi_{\theta}{(q \oplus d)^{T}}w </math> as a scoring function and <math> \oplus </math> signifies the concatenation of the sequences of <math>q</math> and <math>d</math>, <math> \theta </math> are the parameters of the cross-attention model. The architectures of these two models can be seen below in figure 1.<br />
<br />
<br/><br />
[[File:two-tower-vs-cross-layer.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Comparisons===<br />
<br />
====Inference====<br />
In terms of inference, two-tower architecture has a distinct advantage as being easily seen from the architecture. Since the query and document are embedded independently, <math> \psi{(d)}</math> can be pre-computed. This is an important advantage as it makes the two-tower method feasible. As an example, the authors state the inner product between 100 query embeddings and 1 million document embeddings only takes hundreds of milliseconds on CPUs, however, for the equivalent calculation for a cross-attention-model it takes hours on GPUs. In addition, they remark the retrieval of the relevant documents can be done in sublinear time with respect <math> |\mathcal{Y}| </math> using a maximum inner product (MIPS) algorithm with little loss in the recall.<br />
<br />
====Learning====<br />
In terms of learning the two-tower model compared to BM-25 has a unique advantage of being possible to be fine-tuned for downstream tasks. This problem is within the paradigm of metric learning as the scoring function can be passed as input to an objective function of the log likelihood <math> \max_{\theta} \space log(p_{\theta}(d|q)) </math>. The conditional probability is represented by a SoftMax and then can be rewritten as:<br />
<br />
<br />
<math> p_{\theta}(d|q) = \frac{exp(f_{\theta}(q,d))}{\sum_{d' \in D} exp(f_{\theta}(q,d'))} \tag{1} \label{eq:sampled_SoftMax}</math><br />
<br />
<br />
Here <math> \mathcal{D} </math> is the set of all possible documents, the denominator is a summation over <math>\mathcal{D} </math>. As is customary, they utilize a sampled SoftMax technique to approximate the full-SoftMax. The technique they have employed is to use a small subset of documents in the current training batch, while also using a proper correcting term to ensure the unbiasedness of the partition function.<br />
<br />
====Why do we need pre-training then?====<br />
Although the two-tower model can be fine-tuned for a downstream task getting sufficiently labeled data is not always possible. This is an opportunity to look for better pre-training tasks as the authors have done. The authors suggest getting a set of pre-training task <math> \mathcal{T} </math> with labeled positive pairs of queries of documents/passages and then afterward fine-tuning on the limited amount of data for the downstream task.<br />
<br />
==Pre-Training Tasks==<br />
The authors suggest three pre-training tasks for the encoders of the two-tower retriever, these are ICT, BFS, and WLP. Keep in mind that ICT [8] is not a newly proposed task. These pre-training tasks are developed in accordance with two heuristics they state and our paragraph-level pre-training tasks in contrast to more common sentence-level or word-level tasks:<br />
<br />
# The tasks should capture different resolutions of semantics between query and document. The semantics can be the local context within a paragraph, global consistency within a document, and even semantic relation between two documents.<br />
# Collecting the data for the pre-training tasks should require as little human intervention as possible and be cost-efficient.<br />
<br />
All three of the following tasks can be constructed in a cost-efficient manner and with minimal human intervention by utilizing Wikipedia articles. In figure 2 below the dashed line surrounding the block of text is to be regarded as the document <math> d </math> and the rest of the circled portions of text make up the three queries <math>q_1, q_2, q_3 </math> selected for each task.<br />
<br />
<br/><br />
[[File:pre-train-tasks.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Inverse Cloze Task (ICT)===<br />
This task is given a passage <math> p </math> consisting of <math> n </math> sentences <math>s_{i}, i=1,..,n </math>. <math> s_{i} </math> is randomly sampled and used as the query <math> q </math> and then the document is constructed to be based on the remaining <math> n-1 </math> sentences. This is utilized to meet the local context within a paragraph of the first heuristic. In figure 2 <math> q_1 </math> is an example of a potential query for ICT based on the selected document <math> d </math>.<br />
<br />
===Body First Search (BFS)===<br />
BFS is <math> (q,d) </math> pairs are created by randomly selecting a sentence <math> q_2 </math> and setting it as the query from the first section of a Wikipedia page which contains the selected document <math> d </math>. This is utilized to meet the global context consistency within a document as the first section in Wikipedia generally is an overview of the whole page and they anticipate it to contain information central to the topic. In figure 2 <math> q_2 </math> is an example of a potential query for BFS based on the selected document <math> d </math>.<br />
<br />
===Wiki Link Prediction (WLP)===<br />
The third task is selecting a hyperlink that is within the selected document <math> d </math>, as can be observed in figure 2 "machine learning" is a valid option. The query <math> q_3 </math> will then be a random sentence on the first section of the page where the hyperlinks redirects to. This is utilized to provide the semantic relation between documents.<br />
<br />
<br />
Additionally, Masked LM (MLM) is considered during the experimentations which again is one of the primary pre-training tasks used in BERT.<br />
<br />
==Experiments and Results==<br />
<br />
===Training Specifications===<br />
<br />
*Each tower is constructed from the 12 layers BERT-base model.<br />
*Embedding is achieved by applying a linear layer on the [CLS] token output to get an embedding dimension of 512.<br />
*Sequence lengths of 64 and 288 respectively for the query and document encoders.<br />
*Pre-trained on 32 TPU v3 chips for 100k step with Adam optimizer learning rate of <math> 1 \times 10^{-4} </math> with 0.1 warm-up ratio, followed by linear learning rate decay and batch size of 8192. (~2.5 days to complete)<br />
*Fine tuning on downstream task <math> 5 \times 10^{-5} </math> with 2000 training steps and batch size 512.<br />
*Tokenizer is WordPiece.<br />
<br />
===Pre-Training Tasks Setup===<br />
<br />
*MLM, ICT, BFS, and WLP are used.<br />
*Various combinations of the aforementioned tasks are tested.<br />
*Tasks define <math> (q,d) </math> pairings.<br />
*ICT's document <math> d </math> has the article title and passage separated by a [SEP] symbol as input to the document encoder.<br />
<br />
Below table 1 shows some statistics for the datasets constructed for the ICT, BFS, and WLP tasks. Token counts considered after the tokenizer is applied.<br />
<br />
[[File: wiki-pre-train-statistics.PNG | center]]<br />
<br />
===Downstream Tasks Setup===<br />
Retrieval Question-Answering ('''ReQA''') benchmark used.<br />
<br />
*Datasets: '''SQuAD''' and '''Natural Questions'''.<br />
*Each entry of data from datasets is <math> (q,a,p) </math>, where each element is, query <math> q </math>, answer <math> a </math>, and passage <math> p </math> containing the answer, respectively.<br />
*Authors split the passage to sentences <math>s_i</math> where <math> i=1,...n </math> and <math> n </math> is the number of sentences in the passage.<br />
*The problem is then recast for a given query to retrieve the correct sentence-passage <math> (s,p) </math> from all candidates for all passages split in this fashion.<br />
*They remark their reformulation makes it a more difficult problem.<br />
<br />
Note: Experiments done on '''ReQA''' benchmarks are not entirely open-domain QA retrieval as the candidate <math> (s,p) </math> only cover the training set of QA dataset instead of entire Wikipedia articles. There are some experiments they did with augmented the dataset as will be seen in table 6.<br />
<br />
Below table 2 shows statistics of the datasets used for the downstream QA task.<br />
<br />
<br />
[[File:data-statistics-30.PNG | center]]<br />
<br />
===Evaluation===<br />
<br />
*3 train/test splits: 1%/99%, 5%/95%, and 80%/20%.<br />
*10% of the training set is used as a validation set for hyper-parameter tuning.<br />
*Training never has seen the test queries in the test pairs of <math>(q,d).</math><br />
*Evaluation metric is recall@k since the goal is to capture positives in the top-k results.<br />
<br />
<br />
===Results===<br />
<br />
Table 3 shows the results on the '''SQuAD''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance, with the exception of R@1 for 1%/99% train/test split. BoW-MLP is used to justify the use of a transformer as the encoder since it has more complexity, BoW-MLP here looks up uni-grams from an embedding table, aggregates the embeddings with average pooling, and passes them through a shallow two-layer ML network with <math> tanh </math> activation to generate the final 512-dimensional query/document embeddings.<br />
<br />
[[File:table3-30.PNG | center]]<br />
<br />
Table 4 shows the results on the '''Natural Questions''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance in all testing scenarios.<br />
<br />
[[File:table4-30.PNG | center]]<br />
<br />
====Ablation Study====<br />
The embedding-dimensions, # of layers in BERT, and pre-training tasks are altered for this ablation study as seen in Table 5.<br />
<br />
All three tasks, ICT, BFS and WLP, perform significantly better than MLM when used individually. When compared with each other, ICT performs better than the other two tasks, followed by BFS and then WLP. Interestingly ICT+BFS+WLP is only an absolute of 1.5% better than just ICT in the low-data regime. The authors infer this could be representative of when there is insufficient downstream data for training that more global pre-training tasks become increasingly beneficial since BFS and WLP offer this.<br />
<br />
[[File:table5-30.PNG | center]]<br />
<br />
====Evaluation of Open-Domain Retrieval====<br />
Augment ReQA benchmark with large-scale (sentence, evidence passage) pairs extracted from general Wikipedia. This is added as one million external candidate pairs into the existing retrieval candidate set of the ReQA benchmark. <br />
<br />
[[File:table6-30.PNG | center]]<br />
<br />
The 3 talking points about the above results are as follows <br><br />
• The two-tower Transformer models pretrained with ICT+BFS+WLP and ICT substantially outperform the BM-25 baseline. <br><br />
• ICT+BFS+WLP pre-training method consistently improves the ICT pre-training method <br><br />
• Results here are fairly consistent with previous results.<br />
<br />
==Conclusion==<br />
<br />
The authors performed a comprehensive study and have concluded that properly designing paragraph-level pre-training tasks including ICT, BFS, and WLP for a two-tower transformer model can significantly outperform the BM-25 algorithm, which is an unsupervised method using weighted token-matching as a scoring function. The findings of this paper suggest that a two-tower transformer model with proper pre-training tasks can replace the BM25 algorithm used in the retrieval stage to further improve end-to-end system performance. The authors plan to test the pre-training tasks with a larger variety of encoder architectures and trying other corpora than Wikipedia and additionally trying pre-training in contrast to different regularization methods.<br />
<br />
==Critique==<br />
<br />
The authors of this paper suggest from their experiments that the combination of the 3 pretraining tasks ICT, BFS, and WLP yields the best results in general. However, they do acknowledge why two of the three pre-training tasks may not produce equally significant results as the three. From the ablation study and the small margin (1.5%) in which the three tasks achieved outperformed only using ICT in combination with observing ICT outperforming the three as seen in table 6 in certain scenarios. It seems plausible that a subset of two tasks could also potentially outperform the three or reach a similar performance. I believe this should be addressed in order to conclude three pre-training tasks are needed over two, though they never address this comparison I think it would make the paper more complete.<br />
<br />
The two-tower transformer model has two encoders, one for query encoding, and one for passage encoding. But in theory, queries and passages are all textual information. Do we really need two different encoders? Maybe using one encoder to encode both queries and passages can achieve competitive results. I'd like to see if any paper has ever covered this.<br />
<br />
As this paper falls under the family of approaches that involves the dual encoder architecture where the query and document are encoded independently of each other and efficient retrieval is achieved using approximate nearest neighbor search as they have used and they have just shown that such dilemma can be resolved with matching-oriented pretraining that imitate the matching between the question and paragraph in open-domain QA. However, the existing pretraining strategies could be highly sample-inefficient and typically require a very large batch size (up to thousands) such that diverse and effective negative question-paragraph pairs could be included in each batch. As mentioned in Equation \eqref{eq:sampled_SoftMax}, they utilized a sampled SoftMax technique to approximate the full-SoftMax, which has the main advantage which saves a lot of memory during training instead of load the whole dataset to the memory. Another thing that might be interesting to implement is to use dense representation for the 3 pretraining tasks ICT, BFS, and WLP as mention in this paper [9], which also used a different way of in-batch sampling during the training of the encoders and last and not least, instead of using the ICT in training, I would like to compare it with synthetic queries by considering two decoding techniques, which are beam search and nucleus sampling illustrated in this paper [10].<br />
It is important to note that pre-training tasks such as BFS and WLP require a specific structure of the document. BFS randomly selects a sentence in the first section of a Wikipedia page to be used as a query and a random paragraph for that same page as the document. WLP behaves the same way as BFS for the query, but samples the paragraph for another Wikipedia page linked to the current one using a hyperlink. These constraints may slow down the adoption of these methods on general text corpora.<br />
<br />
== References ==<br />
[1] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.<br />
<br />
[2] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.<br />
<br />
[3] Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, 2019.<br />
<br />
[4] Ashish Vaswani and Noam Shazeer, et al. Attention Is All You Need. arXiv:1706.03762, 2017.<br />
<br />
[5] Ahmad, Amin and Constant, Noah and Yang, Yinfei and Cer, Daniel. ReQA: An Evaluation for End-to-End Answer Retrieval Models. arXiv:1907.04780, 2019.<br />
<br />
[6] Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250, 2016.<br />
<br />
[7] Google. NQ: Natural Questions. https://ai.google.com/research/NaturalQuestions<br />
<br />
[8] Kenton Lee and Ming-Wei Chang and Kristina Toutanova. Latent Retrieval for Weakly Supervised Open Domain Question Answering. arXiv:1906.00300, 2019.<br />
<br />
[9] Karpukhin, V., O˘guz, B., Min, S., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. CoRR (04 2020)<br />
<br />
[10] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. International Conference on Learning Representations (ICLR). (2020)</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval&diff=49835Pre-Training Tasks For Embedding-Based Large-Scale Retrieval2020-12-08T04:53:05Z<p>N2chatti: /* Introduction */</p>
<hr />
<div><br />
==Contributors==<br />
Pierre McWhannel wrote this summary with editorial and technical contributions from STAT 946 fall 2020 classmates. The summary is based on the paper "Pre-Training Tasks for Embedding-Based Large-Scale Retrieval" which was presented at ICLR 2020. The authors of this paper are Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar [1].<br />
<br />
==Introduction==<br />
The problem domain in which the paper is positioned is large-scale query-document retrieval. This problem is: given a query/question to collect relevant documents which contain the answer(s) and identify the span of words which contain the answer(s). This problem is often separated into two steps: 1) a retrieval phase which reduces a large corpus to a much smaller subset. Variants of the classical probabilistic retrieval model BM25 are usually used during this phase to rapidly filter a set of documents. BM25 is considered as one of strongest baseline algorithms for IR and is known for being highly effective 2) a reader which is applied to read the shortened list of documents and score them based on their ability to answer the query and identify the span containing the answer, these spans can be ranked according to some scoring function. In the setting of an open-domain question answering, the query, <math>q</math>, is a question and the documents <math>d</math> represent a passage which may contain the answer(s). The focus of this paper is on the retriever question answering (QA), in this setting a scoring function <math>f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} </math> is utilized to map <math>(q,d) \in \mathcal{X} \times \mathcal{Y}</math> to a real value which represents the score of how relevant the passage is to the query. We desire high scores for relevant passages and low scores otherwise and this is the job of the retriever. <br />
<br />
<br />
Certain characteristics desired of the retriever are high recall since the reader will only be applied to a subset meaning there is an opportunity to identify false positives but not false negatives. The other characteristic desired of a retriever is low latency meaning it is computationally efficient to be used for real-world applications. The popular approach to meet these heuristics has been the BM-25 [2] which relies on token-based matching between two high-dimensional and sparse vectors representing <math>q</math> and <math>d</math>. This approach performs well and retrieval can be performed in time sublinear to the number of passages with inverter indexing. The downfall of this type of algorithm is its inability to be optimized for a specific task. An alternate option is BERT [3] or transformer-based models [4] with cross attention between query and passage pairs which can be optimized for a specific task. For example, BERT assigned both masked language model and Next Sentence Prediction, pre-trained on various open resources data(BookCorpus + English WIKIPEDIA, CC NEWS, OpenWebtex, and Stories) with total of 160 GB. However, these models suffer from latency as the retriever needs to process all the pairwise combinations of a query with all passages. The next type of algorithm is an embedding-based model that jointly embeds the query and passage in the same embedding space and then uses measures such as cosine distance, inner product, or even the Euclidean distance in this space to get a score. The authors suggest the two-tower models can capture deeper semantic relationships in addition to being able to be optimized for specific tasks. This model is referred to as the "two-tower retrieval model", since it has two transformer-based models where one embeds the query <math>\phi{(\cdot)}</math> and the other the passage <math>\psi{(\cdot)}</math>, then the embeddings can be scored by <math>f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. This model can avoid the latency problems of a cross-layer model by pre-computing <math>\psi{(d)}</math> beforehand and then by utilizing efficient approximate nearest neighbor search algorithms in the embedding space to find the nearest documents.<br />
<br />
<br />
These two-tower retrieval models often use pre-trained weights from BERT. In BERT pre-trained model, the weight is obtained by the pre-training tasks are masked-LM (MLM) and next sentence prediction (NSP). The authors note that the research of pre-training tasks that improve the performance of two-tower retrieval models is an unsolved research problem. This is exactly what the authors have done in this paper. That is, to develop pre-training tasks for the two-tower model based on heuristics and then validated by experimental results. The pre-training tasks suggested are Inverse Cloze Task (ICT), Body First Search (BFS), Wiki Link Prediction (WLP), and combining all three. The authors used the Retrieval Question-Answering ('''ReQA''') benchmark [5] and used two datasets '''SQuAD'''[6] and '''Natural Questions'''[7] for training and evaluating their models.<br />
<br />
<br />
<br />
===Contributions of this paper===<br />
*The two-tower transformer model with proper pre-training can significantly outperform the widely used BM-25 Algorithm.<br />
*Paragraph-level pre-training tasks such as ICT, BFS, and WLP hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains.<br />
*The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart (BoW-MLP)<br />
<br />
==Background on Two-Tower Retrieval Models==<br />
This section is to present the two-tower model in more detail. As seen previously <math>q \in \mathcal{X}</math>, <math> d \in \mathcal{y}</math> are the query and document or passage respectively. The two-tower retrieval model consists of two encoders: <math>\phi: \mathcal{X} \rightarrow \mathbb{R}^k</math> and <math>\psi: \mathcal{Y} \rightarrow \mathbb{R}^k</math>. The tokens in <math>\mathcal{X},\mathcal{Y}</math> are a sequence of tokens representing the query and document respectively. The scoring function is <math> f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. The cross-attention BERT-style model would have <math> f_{\theta,w}(q,d) = \psi_{\theta}{(q \oplus d)^{T}}w </math> as a scoring function and <math> \oplus </math> signifies the concatenation of the sequences of <math>q</math> and <math>d</math>, <math> \theta </math> are the parameters of the cross-attention model. The architectures of these two models can be seen below in figure 1.<br />
<br />
<br/><br />
[[File:two-tower-vs-cross-layer.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Comparisons===<br />
<br />
====Inference====<br />
In terms of inference, two-tower architecture has a distinct advantage as being easily seen from the architecture. Since the query and document are embedded independently, <math> \psi{(d)}</math> can be pre-computed. This is an important advantage as it makes the two-tower method feasible. As an example, the authors state the inner product between 100 query embeddings and 1 million document embeddings only takes hundreds of milliseconds on CPUs, however, for the equivalent calculation for a cross-attention-model it takes hours on GPUs. In addition, they remark the retrieval of the relevant documents can be done in sublinear time with respect <math> |\mathcal{Y}| </math> using a maximum inner product (MIPS) algorithm with little loss in the recall.<br />
<br />
====Learning====<br />
In terms of learning the two-tower model compared to BM-25 has a unique advantage of being possible to be fine-tuned for downstream tasks. This problem is within the paradigm of metric learning as the scoring function can be passed as input to an objective function of the log likelihood <math> \max_{\theta} \space log(p_{\theta}(d|q)) </math>. The conditional probability is represented by a SoftMax and then can be rewritten as:<br />
<br />
<br />
<math> p_{\theta}(d|q) = \frac{exp(f_{\theta}(q,d))}{\sum_{d' \in D} exp(f_{\theta}(q,d'))} \tag{1} \label{eq:sampled_SoftMax}</math><br />
<br />
<br />
Here <math> \mathcal{D} </math> is the set of all possible documents, the denominator is a summation over <math>\mathcal{D} </math>. As is customary, they utilize a sampled SoftMax technique to approximate the full-SoftMax. The technique they have employed is to use a small subset of documents in the current training batch, while also using a proper correcting term to ensure the unbiasedness of the partition function.<br />
<br />
====Why do we need pre-training then?====<br />
Although the two-tower model can be fine-tuned for a downstream task getting sufficiently labeled data is not always possible. This is an opportunity to look for better pre-training tasks as the authors have done. The authors suggest getting a set of pre-training task <math> \mathcal{T} </math> with labeled positive pairs of queries of documents/passages and then afterward fine-tuning on the limited amount of data for the downstream task.<br />
<br />
==Pre-Training Tasks==<br />
The authors suggest three pre-training tasks for the encoders of the two-tower retriever, these are ICT, BFS, and WLP. Keep in mind that ICT [8] is not a newly proposed task. These pre-training tasks are developed in accordance with two heuristics they state and our paragraph-level pre-training tasks in contrast to more common sentence-level or word-level tasks:<br />
<br />
# The tasks should capture different resolutions of semantics between query and document. The semantics can be the local context within a paragraph, global consistency within a document, and even semantic relation between two documents.<br />
# Collecting the data for the pre-training tasks should require as little human intervention as possible and be cost-efficient.<br />
<br />
All three of the following tasks can be constructed in a cost-efficient manner and with minimal human intervention by utilizing Wikipedia articles. In figure 2 below the dashed line surrounding the block of text is to be regarded as the document <math> d </math> and the rest of the circled portions of text make up the three queries <math>q_1, q_2, q_3 </math> selected for each task.<br />
<br />
<br/><br />
[[File:pre-train-tasks.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Inverse Cloze Task (ICT)===<br />
This task is given a passage <math> p </math> consisting of <math> n </math> sentences <math>s_{i}, i=1,..,n </math>. <math> s_{i} </math> is randomly sampled and used as the query <math> q </math> and then the document is constructed to be based on the remaining <math> n-1 </math> sentences. This is utilized to meet the local context within a paragraph of the first heuristic. In figure 2 <math> q_1 </math> is an example of a potential query for ICT based on the selected document <math> d </math>.<br />
<br />
===Body First Search (BFS)===<br />
BFS is <math> (q,d) </math> pairs are created by randomly selecting a sentence <math> q_2 </math> and setting it as the query from the first section of a Wikipedia page which contains the selected document <math> d </math>. This is utilized to meet the global context consistency within a document as the first section in Wikipedia generally is an overview of the whole page and they anticipate it to contain information central to the topic. In figure 2 <math> q_2 </math> is an example of a potential query for BFS based on the selected document <math> d </math>.<br />
<br />
===Wiki Link Prediction (WLP)===<br />
The third task is selecting a hyperlink that is within the selected document <math> d </math>, as can be observed in figure 2 "machine learning" is a valid option. The query <math> q_3 </math> will then be a random sentence on the first section of the page where the hyperlinks redirects to. This is utilized to provide the semantic relation between documents.<br />
<br />
<br />
Additionally, Masked LM (MLM) is considered during the experimentations which again is one of the primary pre-training tasks used in BERT.<br />
<br />
==Experiments and Results==<br />
<br />
===Training Specifications===<br />
<br />
*Each tower is constructed from the 12 layers BERT-base model.<br />
*Embedding is achieved by applying a linear layer on the [CLS] token output to get an embedding dimension of 512.<br />
*Sequence lengths of 64 and 288 respectively for the query and document encoders.<br />
*Pre-trained on 32 TPU v3 chips for 100k step with Adam optimizer learning rate of <math> 1 \times 10^{-4} </math> with 0.1 warm-up ratio, followed by linear learning rate decay and batch size of 8192. (~2.5 days to complete)<br />
*Fine tuning on downstream task <math> 5 \times 10^{-5} </math> with 2000 training steps and batch size 512.<br />
*Tokenizer is WordPiece.<br />
<br />
===Pre-Training Tasks Setup===<br />
<br />
*MLM, ICT, BFS, and WLP are used.<br />
*Various combinations of the aforementioned tasks are tested.<br />
*Tasks define <math> (q,d) </math> pairings.<br />
*ICT's document <math> d </math> has the article title and passage separated by a [SEP] symbol as input to the document encoder.<br />
<br />
Below table 1 shows some statistics for the datasets constructed for the ICT, BFS, and WLP tasks. Token counts considered after the tokenizer is applied.<br />
<br />
[[File: wiki-pre-train-statistics.PNG | center]]<br />
<br />
===Downstream Tasks Setup===<br />
Retrieval Question-Answering ('''ReQA''') benchmark used.<br />
<br />
*Datasets: '''SQuAD''' and '''Natural Questions'''.<br />
*Each entry of data from datasets is <math> (q,a,p) </math>, where each element is, query <math> q </math>, answer <math> a </math>, and passage <math> p </math> containing the answer, respectively.<br />
*Authors split the passage to sentences <math>s_i</math> where <math> i=1,...n </math> and <math> n </math> is the number of sentences in the passage.<br />
*The problem is then recast for a given query to retrieve the correct sentence-passage <math> (s,p) </math> from all candidates for all passages split in this fashion.<br />
*They remark their reformulation makes it a more difficult problem.<br />
<br />
Note: Experiments done on '''ReQA''' benchmarks are not entirely open-domain QA retrieval as the candidate <math> (s,p) </math> only cover the training set of QA dataset instead of entire Wikipedia articles. There are some experiments they did with augmented the dataset as will be seen in table 6.<br />
<br />
Below table 2 shows statistics of the datasets used for the downstream QA task.<br />
<br />
<br />
[[File:data-statistics-30.PNG | center]]<br />
<br />
===Evaluation===<br />
<br />
*3 train/test splits: 1%/99%, 5%/95%, and 80%/20%.<br />
*10% of the training set is used as a validation set for hyper-parameter tuning.<br />
*Training never has seen the test queries in the test pairs of <math>(q,d).</math><br />
*Evaluation metric is recall@k since the goal is to capture positives in the top-k results.<br />
<br />
<br />
===Results===<br />
<br />
Table 3 shows the results on the '''SQuAD''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance, with the exception of R@1 for 1%/99% train/test split. BoW-MLP is used to justify the use of a transformer as the encoder since it has more complexity, BoW-MLP here looks up uni-grams from an embedding table, aggregates the embeddings with average pooling, and passes them through a shallow two-layer ML network with <math> tanh </math> activation to generate the final 512-dimensional query/document embeddings.<br />
<br />
[[File:table3-30.PNG | center]]<br />
<br />
Table 4 shows the results on the '''Natural Questions''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance in all testing scenarios.<br />
<br />
[[File:table4-30.PNG | center]]<br />
<br />
====Ablation Study====<br />
The embedding-dimensions, # of layers in BERT, and pre-training tasks are altered for this ablation study as seen in Table 5.<br />
<br />
All three tasks, ICT, BFS and WLP, perform significantly better than MLM when used individually. When compared with each other, ICT performs better than the other two tasks, followed by BFS and then WLP. Interestingly ICT+BFS+WLP is only an absolute of 1.5% better than just ICT in the low-data regime. The authors infer this could be representative of when there is insufficient downstream data for training that more global pre-training tasks become increasingly beneficial since BFS and WLP offer this.<br />
<br />
[[File:table5-30.PNG | center]]<br />
<br />
====Evaluation of Open-Domain Retrieval====<br />
Augment ReQA benchmark with large-scale (sentence, evidence passage) pairs extracted from general Wikipedia. This is added as one million external candidate pairs into the existing retrieval candidate set of the ReQA benchmark. <br />
<br />
[[File:table6-30.PNG | center]]<br />
<br />
The 3 talking points about the above results are as follows <br><br />
• The two-tower Transformer models pretrained with ICT+BFS+WLP and ICT substantially outperform the BM-25 baseline. <br><br />
• ICT+BFS+WLP pre-training method consistently improves the ICT pre-training method <br><br />
• Results here are fairly consistent with previous results.<br />
<br />
==Conclusion==<br />
<br />
The authors performed a comprehensive study and have concluded that properly designing paragraph-level pre-training tasks including ICT, BFS, and WLP for a two-tower transformer model can significantly outperform the BM-25 algorithm, which is an unsupervised method using weighted token-matching as a scoring function. The findings of this paper suggest that a two-tower transformer model with proper pre-training tasks can replace the BM25 algorithm used in the retrieval stage to further improve end-to-end system performance. The authors plan to test the pre-training tasks with a larger variety of encoder architectures and trying other corpora than Wikipedia and additionally trying pre-training in contrast to different regularization methods.<br />
<br />
==Critique==<br />
<br />
The authors of this paper suggest from their experiments that the combination of the 3 pretraining tasks ICT, BFS, and WLP yields the best results in general. However, they do acknowledge why two of the three pre-training tasks may not produce equally significant results as the three. From the ablation study and the small margin (1.5%) in which the three tasks achieved outperformed only using ICT in combination with observing ICT outperforming the three as seen in table 6 in certain scenarios. It seems plausible that a subset of two tasks could also potentially outperform the three or reach a similar performance. I believe this should be addressed in order to conclude three pre-training tasks are needed over two, though they never address this comparison I think it would make the paper more complete.<br />
<br />
The two-tower transformer model has two encoders, one for query encoding, and one for passage encoding. But in theory, queries and passages are all textual information. Do we really need two different encoders? Maybe using one encoder to encode both queries and passages can achieve competitive results. I'd like to see if any paper has ever covered this.<br />
<br />
As this paper full under the family of approaches that involves the dual encoder architecture where the query and document are encoded independently of each other and efficient retrieval is achieved using approximate nearest neighbor search as they have used and they have just shown that such dilemma can be resolved with matching-oriented pretraining that imitate the matching between the question and paragraph in open-domain QA. However, the existing pretraining strategies could be highly sample-inefficient and typically require a very large batch size (up to thousands) such that diverse and effective negative question-paragraph pairs could be included in each batch. As mention in Equation \eqref{eq:sampled_SoftMax}, they utilized a sampled SoftMax technique to approximate the full-SoftMax, which has the main advantage which saves a lot of memory during training instead of load the whole dataset to the memory. Another thing that might be interesting to implement is to use dense representation for the 3 pretraining tasks ICT, BFS, and WLP as mention in this paper [9], which also used a different way of in-batch sampling during the training of the encoders and last and not least, instead of using the ICT in training, I would like to compare it with synthetic queries by considering two decoding techniques, which are beam search and nucleus sampling illustrated in this paper [10].<br />
<br />
== References ==<br />
[1] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.<br />
<br />
[2] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.<br />
<br />
[3] Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, 2019.<br />
<br />
[4] Ashish Vaswani and Noam Shazeer, et al. Attention Is All You Need. arXiv:1706.03762, 2017.<br />
<br />
[5] Ahmad, Amin and Constant, Noah and Yang, Yinfei and Cer, Daniel. ReQA: An Evaluation for End-to-End Answer Retrieval Models. arXiv:1907.04780, 2019.<br />
<br />
[6] Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250, 2016.<br />
<br />
[7] Google. NQ: Natural Questions. https://ai.google.com/research/NaturalQuestions<br />
<br />
[8] Kenton Lee and Ming-Wei Chang and Kristina Toutanova. Latent Retrieval for Weakly Supervised Open Domain Question Answering. arXiv:1906.00300, 2019.<br />
<br />
[9] Karpukhin, V., O˘guz, B., Min, S., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. CoRR (04 2020)<br />
<br />
[10] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. International Conference on Learning Representations (ICLR). (2020)</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval&diff=49834Pre-Training Tasks For Embedding-Based Large-Scale Retrieval2020-12-08T04:52:32Z<p>N2chatti: /* Introduction */</p>
<hr />
<div><br />
==Contributors==<br />
Pierre McWhannel wrote this summary with editorial and technical contributions from STAT 946 fall 2020 classmates. The summary is based on the paper "Pre-Training Tasks for Embedding-Based Large-Scale Retrieval" which was presented at ICLR 2020. The authors of this paper are Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar [1].<br />
<br />
==Introduction==<br />
The problem domain in which the paper is positioned is large-scale query-document retrieval. This problem is: given a query/question to collect relevant documents which contain the answer(s) and identify the span of words which contain the answer(s). This problem is often separated into two steps 1) a retrieval phase which reduces a large corpus to a much smaller subset. Variants of the classical probabilistic retrieval model BM25 are usually used during this phase to rapidly filter a set of documents. BM25 is considered as one of strongest baseline algorithms for IR and is known for being highly effective) 2) a reader which is applied to read the shortened list of documents and score them based on their ability to answer the query and identify the span containing the answer, these spans can be ranked according to some scoring function. In the setting of an open-domain question answering, the query, <math>q</math>, is a question and the documents <math>d</math> represent a passage which may contain the answer(s). The focus of this paper is on the retriever question answering (QA), in this setting a scoring function <math>f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} </math> is utilized to map <math>(q,d) \in \mathcal{X} \times \mathcal{Y}</math> to a real value which represents the score of how relevant the passage is to the query. We desire high scores for relevant passages and low scores otherwise and this is the job of the retriever. <br />
<br />
<br />
Certain characteristics desired of the retriever are high recall since the reader will only be applied to a subset meaning there is an opportunity to identify false positives but not false negatives. The other characteristic desired of a retriever is low latency meaning it is computationally efficient to be used for real-world applications. The popular approach to meet these heuristics has been the BM-25 [2] which relies on token-based matching between two high-dimensional and sparse vectors representing <math>q</math> and <math>d</math>. This approach performs well and retrieval can be performed in time sublinear to the number of passages with inverter indexing. The downfall of this type of algorithm is its inability to be optimized for a specific task. An alternate option is BERT [3] or transformer-based models [4] with cross attention between query and passage pairs which can be optimized for a specific task. For example, BERT assigned both masked language model and Next Sentence Prediction, pre-trained on various open resources data(BookCorpus + English WIKIPEDIA, CC NEWS, OpenWebtex, and Stories) with total of 160 GB. However, these models suffer from latency as the retriever needs to process all the pairwise combinations of a query with all passages. The next type of algorithm is an embedding-based model that jointly embeds the query and passage in the same embedding space and then uses measures such as cosine distance, inner product, or even the Euclidean distance in this space to get a score. The authors suggest the two-tower models can capture deeper semantic relationships in addition to being able to be optimized for specific tasks. This model is referred to as the "two-tower retrieval model", since it has two transformer-based models where one embeds the query <math>\phi{(\cdot)}</math> and the other the passage <math>\psi{(\cdot)}</math>, then the embeddings can be scored by <math>f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. This model can avoid the latency problems of a cross-layer model by pre-computing <math>\psi{(d)}</math> beforehand and then by utilizing efficient approximate nearest neighbor search algorithms in the embedding space to find the nearest documents.<br />
<br />
<br />
These two-tower retrieval models often use pre-trained weights from BERT. In BERT pre-trained model, the weight is obtained by the pre-training tasks are masked-LM (MLM) and next sentence prediction (NSP). The authors note that the research of pre-training tasks that improve the performance of two-tower retrieval models is an unsolved research problem. This is exactly what the authors have done in this paper. That is, to develop pre-training tasks for the two-tower model based on heuristics and then validated by experimental results. The pre-training tasks suggested are Inverse Cloze Task (ICT), Body First Search (BFS), Wiki Link Prediction (WLP), and combining all three. The authors used the Retrieval Question-Answering ('''ReQA''') benchmark [5] and used two datasets '''SQuAD'''[6] and '''Natural Questions'''[7] for training and evaluating their models.<br />
<br />
<br />
<br />
===Contributions of this paper===<br />
*The two-tower transformer model with proper pre-training can significantly outperform the widely used BM-25 Algorithm.<br />
*Paragraph-level pre-training tasks such as ICT, BFS, and WLP hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains.<br />
*The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart (BoW-MLP)<br />
<br />
==Background on Two-Tower Retrieval Models==<br />
This section is to present the two-tower model in more detail. As seen previously <math>q \in \mathcal{X}</math>, <math> d \in \mathcal{y}</math> are the query and document or passage respectively. The two-tower retrieval model consists of two encoders: <math>\phi: \mathcal{X} \rightarrow \mathbb{R}^k</math> and <math>\psi: \mathcal{Y} \rightarrow \mathbb{R}^k</math>. The tokens in <math>\mathcal{X},\mathcal{Y}</math> are a sequence of tokens representing the query and document respectively. The scoring function is <math> f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. The cross-attention BERT-style model would have <math> f_{\theta,w}(q,d) = \psi_{\theta}{(q \oplus d)^{T}}w </math> as a scoring function and <math> \oplus </math> signifies the concatenation of the sequences of <math>q</math> and <math>d</math>, <math> \theta </math> are the parameters of the cross-attention model. The architectures of these two models can be seen below in figure 1.<br />
<br />
<br/><br />
[[File:two-tower-vs-cross-layer.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Comparisons===<br />
<br />
====Inference====<br />
In terms of inference, two-tower architecture has a distinct advantage as being easily seen from the architecture. Since the query and document are embedded independently, <math> \psi{(d)}</math> can be pre-computed. This is an important advantage as it makes the two-tower method feasible. As an example, the authors state the inner product between 100 query embeddings and 1 million document embeddings only takes hundreds of milliseconds on CPUs, however, for the equivalent calculation for a cross-attention-model it takes hours on GPUs. In addition, they remark the retrieval of the relevant documents can be done in sublinear time with respect <math> |\mathcal{Y}| </math> using a maximum inner product (MIPS) algorithm with little loss in the recall.<br />
<br />
====Learning====<br />
In terms of learning the two-tower model compared to BM-25 has a unique advantage of being possible to be fine-tuned for downstream tasks. This problem is within the paradigm of metric learning as the scoring function can be passed as input to an objective function of the log likelihood <math> \max_{\theta} \space log(p_{\theta}(d|q)) </math>. The conditional probability is represented by a SoftMax and then can be rewritten as:<br />
<br />
<br />
<math> p_{\theta}(d|q) = \frac{exp(f_{\theta}(q,d))}{\sum_{d' \in D} exp(f_{\theta}(q,d'))} \tag{1} \label{eq:sampled_SoftMax}</math><br />
<br />
<br />
Here <math> \mathcal{D} </math> is the set of all possible documents, the denominator is a summation over <math>\mathcal{D} </math>. As is customary, they utilize a sampled SoftMax technique to approximate the full-SoftMax. The technique they have employed is to use a small subset of documents in the current training batch, while also using a proper correcting term to ensure the unbiasedness of the partition function.<br />
<br />
====Why do we need pre-training then?====<br />
Although the two-tower model can be fine-tuned for a downstream task getting sufficiently labeled data is not always possible. This is an opportunity to look for better pre-training tasks as the authors have done. The authors suggest getting a set of pre-training task <math> \mathcal{T} </math> with labeled positive pairs of queries of documents/passages and then afterward fine-tuning on the limited amount of data for the downstream task.<br />
<br />
==Pre-Training Tasks==<br />
The authors suggest three pre-training tasks for the encoders of the two-tower retriever, these are ICT, BFS, and WLP. Keep in mind that ICT [8] is not a newly proposed task. These pre-training tasks are developed in accordance with two heuristics they state and our paragraph-level pre-training tasks in contrast to more common sentence-level or word-level tasks:<br />
<br />
# The tasks should capture different resolutions of semantics between query and document. The semantics can be the local context within a paragraph, global consistency within a document, and even semantic relation between two documents.<br />
# Collecting the data for the pre-training tasks should require as little human intervention as possible and be cost-efficient.<br />
<br />
All three of the following tasks can be constructed in a cost-efficient manner and with minimal human intervention by utilizing Wikipedia articles. In figure 2 below the dashed line surrounding the block of text is to be regarded as the document <math> d </math> and the rest of the circled portions of text make up the three queries <math>q_1, q_2, q_3 </math> selected for each task.<br />
<br />
<br/><br />
[[File:pre-train-tasks.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Inverse Cloze Task (ICT)===<br />
This task is given a passage <math> p </math> consisting of <math> n </math> sentences <math>s_{i}, i=1,..,n </math>. <math> s_{i} </math> is randomly sampled and used as the query <math> q </math> and then the document is constructed to be based on the remaining <math> n-1 </math> sentences. This is utilized to meet the local context within a paragraph of the first heuristic. In figure 2 <math> q_1 </math> is an example of a potential query for ICT based on the selected document <math> d </math>.<br />
<br />
===Body First Search (BFS)===<br />
BFS is <math> (q,d) </math> pairs are created by randomly selecting a sentence <math> q_2 </math> and setting it as the query from the first section of a Wikipedia page which contains the selected document <math> d </math>. This is utilized to meet the global context consistency within a document as the first section in Wikipedia generally is an overview of the whole page and they anticipate it to contain information central to the topic. In figure 2 <math> q_2 </math> is an example of a potential query for BFS based on the selected document <math> d </math>.<br />
<br />
===Wiki Link Prediction (WLP)===<br />
The third task is selecting a hyperlink that is within the selected document <math> d </math>, as can be observed in figure 2 "machine learning" is a valid option. The query <math> q_3 </math> will then be a random sentence on the first section of the page where the hyperlinks redirects to. This is utilized to provide the semantic relation between documents.<br />
<br />
<br />
Additionally, Masked LM (MLM) is considered during the experimentations which again is one of the primary pre-training tasks used in BERT.<br />
<br />
==Experiments and Results==<br />
<br />
===Training Specifications===<br />
<br />
*Each tower is constructed from the 12 layers BERT-base model.<br />
*Embedding is achieved by applying a linear layer on the [CLS] token output to get an embedding dimension of 512.<br />
*Sequence lengths of 64 and 288 respectively for the query and document encoders.<br />
*Pre-trained on 32 TPU v3 chips for 100k step with Adam optimizer learning rate of <math> 1 \times 10^{-4} </math> with 0.1 warm-up ratio, followed by linear learning rate decay and batch size of 8192. (~2.5 days to complete)<br />
*Fine tuning on downstream task <math> 5 \times 10^{-5} </math> with 2000 training steps and batch size 512.<br />
*Tokenizer is WordPiece.<br />
<br />
===Pre-Training Tasks Setup===<br />
<br />
*MLM, ICT, BFS, and WLP are used.<br />
*Various combinations of the aforementioned tasks are tested.<br />
*Tasks define <math> (q,d) </math> pairings.<br />
*ICT's document <math> d </math> has the article title and passage separated by a [SEP] symbol as input to the document encoder.<br />
<br />
Below table 1 shows some statistics for the datasets constructed for the ICT, BFS, and WLP tasks. Token counts considered after the tokenizer is applied.<br />
<br />
[[File: wiki-pre-train-statistics.PNG | center]]<br />
<br />
===Downstream Tasks Setup===<br />
Retrieval Question-Answering ('''ReQA''') benchmark used.<br />
<br />
*Datasets: '''SQuAD''' and '''Natural Questions'''.<br />
*Each entry of data from datasets is <math> (q,a,p) </math>, where each element is, query <math> q </math>, answer <math> a </math>, and passage <math> p </math> containing the answer, respectively.<br />
*Authors split the passage to sentences <math>s_i</math> where <math> i=1,...n </math> and <math> n </math> is the number of sentences in the passage.<br />
*The problem is then recast for a given query to retrieve the correct sentence-passage <math> (s,p) </math> from all candidates for all passages split in this fashion.<br />
*They remark their reformulation makes it a more difficult problem.<br />
<br />
Note: Experiments done on '''ReQA''' benchmarks are not entirely open-domain QA retrieval as the candidate <math> (s,p) </math> only cover the training set of QA dataset instead of entire Wikipedia articles. There are some experiments they did with augmented the dataset as will be seen in table 6.<br />
<br />
Below table 2 shows statistics of the datasets used for the downstream QA task.<br />
<br />
<br />
[[File:data-statistics-30.PNG | center]]<br />
<br />
===Evaluation===<br />
<br />
*3 train/test splits: 1%/99%, 5%/95%, and 80%/20%.<br />
*10% of the training set is used as a validation set for hyper-parameter tuning.<br />
*Training never has seen the test queries in the test pairs of <math>(q,d).</math><br />
*Evaluation metric is recall@k since the goal is to capture positives in the top-k results.<br />
<br />
<br />
===Results===<br />
<br />
Table 3 shows the results on the '''SQuAD''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance, with the exception of R@1 for 1%/99% train/test split. BoW-MLP is used to justify the use of a transformer as the encoder since it has more complexity, BoW-MLP here looks up uni-grams from an embedding table, aggregates the embeddings with average pooling, and passes them through a shallow two-layer ML network with <math> tanh </math> activation to generate the final 512-dimensional query/document embeddings.<br />
<br />
[[File:table3-30.PNG | center]]<br />
<br />
Table 4 shows the results on the '''Natural Questions''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance in all testing scenarios.<br />
<br />
[[File:table4-30.PNG | center]]<br />
<br />
====Ablation Study====<br />
The embedding-dimensions, # of layers in BERT, and pre-training tasks are altered for this ablation study as seen in Table 5.<br />
<br />
All three tasks, ICT, BFS and WLP, perform significantly better than MLM when used individually. When compared with each other, ICT performs better than the other two tasks, followed by BFS and then WLP. Interestingly ICT+BFS+WLP is only an absolute of 1.5% better than just ICT in the low-data regime. The authors infer this could be representative of when there is insufficient downstream data for training that more global pre-training tasks become increasingly beneficial since BFS and WLP offer this.<br />
<br />
[[File:table5-30.PNG | center]]<br />
<br />
====Evaluation of Open-Domain Retrieval====<br />
Augment ReQA benchmark with large-scale (sentence, evidence passage) pairs extracted from general Wikipedia. This is added as one million external candidate pairs into the existing retrieval candidate set of the ReQA benchmark. <br />
<br />
[[File:table6-30.PNG | center]]<br />
<br />
The 3 talking points about the above results are as follows <br><br />
• The two-tower Transformer models pretrained with ICT+BFS+WLP and ICT substantially outperform the BM-25 baseline. <br><br />
• ICT+BFS+WLP pre-training method consistently improves the ICT pre-training method <br><br />
• Results here are fairly consistent with previous results.<br />
<br />
==Conclusion==<br />
<br />
The authors performed a comprehensive study and have concluded that properly designing paragraph-level pre-training tasks including ICT, BFS, and WLP for a two-tower transformer model can significantly outperform the BM-25 algorithm, which is an unsupervised method using weighted token-matching as a scoring function. The findings of this paper suggest that a two-tower transformer model with proper pre-training tasks can replace the BM25 algorithm used in the retrieval stage to further improve end-to-end system performance. The authors plan to test the pre-training tasks with a larger variety of encoder architectures and trying other corpora than Wikipedia and additionally trying pre-training in contrast to different regularization methods.<br />
<br />
==Critique==<br />
<br />
The authors of this paper suggest from their experiments that the combination of the 3 pretraining tasks ICT, BFS, and WLP yields the best results in general. However, they do acknowledge why two of the three pre-training tasks may not produce equally significant results as the three. From the ablation study and the small margin (1.5%) in which the three tasks achieved outperformed only using ICT in combination with observing ICT outperforming the three as seen in table 6 in certain scenarios. It seems plausible that a subset of two tasks could also potentially outperform the three or reach a similar performance. I believe this should be addressed in order to conclude three pre-training tasks are needed over two, though they never address this comparison I think it would make the paper more complete.<br />
<br />
The two-tower transformer model has two encoders, one for query encoding, and one for passage encoding. But in theory, queries and passages are all textual information. Do we really need two different encoders? Maybe using one encoder to encode both queries and passages can achieve competitive results. I'd like to see if any paper has ever covered this.<br />
<br />
As this paper full under the family of approaches that involves the dual encoder architecture where the query and document are encoded independently of each other and efficient retrieval is achieved using approximate nearest neighbor search as they have used and they have just shown that such dilemma can be resolved with matching-oriented pretraining that imitate the matching between the question and paragraph in open-domain QA. However, the existing pretraining strategies could be highly sample-inefficient and typically require a very large batch size (up to thousands) such that diverse and effective negative question-paragraph pairs could be included in each batch. As mention in Equation \eqref{eq:sampled_SoftMax}, they utilized a sampled SoftMax technique to approximate the full-SoftMax, which has the main advantage which saves a lot of memory during training instead of load the whole dataset to the memory. Another thing that might be interesting to implement is to use dense representation for the 3 pretraining tasks ICT, BFS, and WLP as mention in this paper [9], which also used a different way of in-batch sampling during the training of the encoders and last and not least, instead of using the ICT in training, I would like to compare it with synthetic queries by considering two decoding techniques, which are beam search and nucleus sampling illustrated in this paper [10].<br />
<br />
== References ==<br />
[1] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.<br />
<br />
[2] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.<br />
<br />
[3] Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, 2019.<br />
<br />
[4] Ashish Vaswani and Noam Shazeer, et al. Attention Is All You Need. arXiv:1706.03762, 2017.<br />
<br />
[5] Ahmad, Amin and Constant, Noah and Yang, Yinfei and Cer, Daniel. ReQA: An Evaluation for End-to-End Answer Retrieval Models. arXiv:1907.04780, 2019.<br />
<br />
[6] Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250, 2016.<br />
<br />
[7] Google. NQ: Natural Questions. https://ai.google.com/research/NaturalQuestions<br />
<br />
[8] Kenton Lee and Ming-Wei Chang and Kristina Toutanova. Latent Retrieval for Weakly Supervised Open Domain Question Answering. arXiv:1906.00300, 2019.<br />
<br />
[9] Karpukhin, V., O˘guz, B., Min, S., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. CoRR (04 2020)<br />
<br />
[10] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. International Conference on Learning Representations (ICLR). (2020)</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49384This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T13:16:55Z<p>N2chatti: /* Datasets */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. [1]<br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN.[2] Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.[3]<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
'''Examples of reasoning process:''' <br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One of the greatest advantage of this ProtopNet is that it allows the user to observe the process of how the model is making predictions and therefore understands the reasoning in case of misclassification errors for example. However one constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.<br />
<br />
== Source code ==<br />
The code for this paper is available at [https://github.com/cfchen-duke/ProtoPNet https://github.com/cfchen-duke/ProtoPNet]<br />
<br />
== Refrences == <br />
[1] C. Chen, O. Li, A. Barnett, J. Su, C. Rudin, This looks like that: deep<br />
learning for interpretable image recognition, arXiv preprint,<br />
arXiv:1806.10574, 2018.<br />
<br />
[1] A. Holt, I. Bichindaritz, R. Schmidt, and P. Perner. Medical applications in case-based reasoning. The<br />
Knowledge Engineering Review, 20:289–292, 09 2005.<br />
<br />
[2] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image<br />
Classification Models and Saliency Maps. In Workshop at the 2nd International Conference on Learning<br />
Representations (ICLR Workshop), 2014<br />
<br />
[3] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative<br />
Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),<br />
pages 2921–2929. IEEE, 2016</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49383This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T13:16:33Z<p>N2chatti: </p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. [1]<br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN.[2] Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.[3]<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
'''Examples of reasoning process''' <br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One of the greatest advantage of this ProtopNet is that it allows the user to observe the process of how the model is making predictions and therefore understands the reasoning in case of misclassification errors for example. However one constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.<br />
<br />
== Source code ==<br />
The code for this paper is available at [https://github.com/cfchen-duke/ProtoPNet https://github.com/cfchen-duke/ProtoPNet]<br />
<br />
== Refrences == <br />
[1] C. Chen, O. Li, A. Barnett, J. Su, C. Rudin, This looks like that: deep<br />
learning for interpretable image recognition, arXiv preprint,<br />
arXiv:1806.10574, 2018.<br />
<br />
[1] A. Holt, I. Bichindaritz, R. Schmidt, and P. Perner. Medical applications in case-based reasoning. The<br />
Knowledge Engineering Review, 20:289–292, 09 2005.<br />
<br />
[2] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image<br />
Classification Models and Saliency Maps. In Workshop at the 2nd International Conference on Learning<br />
Representations (ICLR Workshop), 2014<br />
<br />
[3] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative<br />
Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),<br />
pages 2921–2929. IEEE, 2016</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49325This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T02:37:10Z<p>N2chatti: /* Conclusion */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. [1]<br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN.[2] Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.[3]<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results: ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One of the greatest advantage of this ProtopNet is that it allows the user to observe the process of how the model is making predictions and therefore understands the reasoning in case of misclassification errors for example. However one constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.<br />
<br />
== Source code ==<br />
The code for this paper is available at [https://github.com/cfchen-duke/ProtoPNet https://github.com/cfchen-duke/ProtoPNet]<br />
<br />
== Refrences == <br />
[1] C. Chen, O. Li, A. Barnett, J. Su, C. Rudin, This looks like that: deep<br />
learning for interpretable image recognition, arXiv preprint,<br />
arXiv:1806.10574, 2018.<br />
<br />
[1] A. Holt, I. Bichindaritz, R. Schmidt, and P. Perner. Medical applications in case-based reasoning. The<br />
Knowledge Engineering Review, 20:289–292, 09 2005.<br />
<br />
[2] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image<br />
Classification Models and Saliency Maps. In Workshop at the 2nd International Conference on Learning<br />
Representations (ICLR Workshop), 2014<br />
<br />
[3] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative<br />
Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),<br />
pages 2921–2929. IEEE, 2016</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49324This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T02:27:25Z<p>N2chatti: /* Refrences */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. [1]<br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN.[2] Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.[3]<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results: ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.<br />
<br />
== Source code ==<br />
The code for this paper is available at [https://github.com/cfchen-duke/ProtoPNet https://github.com/cfchen-duke/ProtoPNet]<br />
<br />
== Refrences == <br />
[1] C. Chen, O. Li, A. Barnett, J. Su, C. Rudin, This looks like that: deep<br />
learning for interpretable image recognition, arXiv preprint,<br />
arXiv:1806.10574, 2018.<br />
<br />
[1] A. Holt, I. Bichindaritz, R. Schmidt, and P. Perner. Medical applications in case-based reasoning. The<br />
Knowledge Engineering Review, 20:289–292, 09 2005.<br />
<br />
[2] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image<br />
Classification Models and Saliency Maps. In Workshop at the 2nd International Conference on Learning<br />
Representations (ICLR Workshop), 2014<br />
<br />
[3] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative<br />
Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),<br />
pages 2921–2929. IEEE, 2016</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49323This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T02:27:10Z<p>N2chatti: </p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. [1]<br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN.[2] Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.[3]<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results: ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.<br />
<br />
== Source code ==<br />
The code for this paper is available at [https://github.com/cfchen-duke/ProtoPNet https://github.com/cfchen-duke/ProtoPNet]<br />
<br />
== Refrences == <br />
[1] C. Chen, O. Li, A. Barnett, J. Su, C. Rudin, This looks like that: deep<br />
learning for interpretable image recognition, arXiv preprint,<br />
arXiv:1806.10574, 2018.<br />
<br />
<br />
[1] A. Holt, I. Bichindaritz, R. Schmidt, and P. Perner. Medical applications in case-based reasoning. The<br />
Knowledge Engineering Review, 20:289–292, 09 2005.<br />
<br />
[2] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image<br />
Classification Models and Saliency Maps. In Workshop at the 2nd International Conference on Learning<br />
Representations (ICLR Workshop), 2014<br />
<br />
[3] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative<br />
Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),<br />
pages 2921–2929. IEEE, 2016</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=49322Self-Supervised Learning of Pretext-Invariant Representations2020-12-06T02:11:54Z<p>N2chatti: /* Image Classification with linear models */</p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations using a large number of data with pre-defined semantic annotations. Some examples of these annotations are class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data that is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community to find image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''Self-Supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show, in self-supervised learning, there is no need for using human-provided class labels or bounding boxes for classification and object detection tasks, respectively. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of used transformation. Two of the most used pretext tasks are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. In the jigsaw task, which is more complicated than the rotation prediction task, unlabeled images are cropped into 9 patches, then, the image is perturbed by randomly permuting the nine patches. Each permutation falls into one of the 35 classes according to a formula. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to greyscale, and image reconstruction, where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
Although the proposed pretext tasks have achieved promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations the are common between the original images and the transformed ones. This idea is supported by the fact that humans are able to recognize these transformed images. This hints us to try to develop a method that obtains image representations that are common between the original and transformed images, in other words, image representations that are transformation invariant. The paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that learns to obtain Self-Supervised image representations that as opposed to Pretext Tasks are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image ,<math>I</math>, in the Dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t)<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two set of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t}}{\tau}) \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t}}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}}}{\tau}) \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers) , <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math> , increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, are increased. During training a memory bank [], <math>m_I</math>, of dataset image representations are used to access the representations of the dataset images including the negative images. The proposed PIRL model is shown in Figure (4). Finally, the contrastive loss in equation (5) does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
Where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in the last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from the ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
A Faster R-CNN model with a ResNet-50 backbone, pre-trained using PIRL and other Self-Supervised methods, is employed for the object detection task. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on the VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised-based methods, '''for the first time it outperforms Supervised Pre-training on object detection'''. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the pre-trained ResNet-50 model is utilized as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results demonstrate that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pre-trained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, the results show that PIRL has the best performance among different Self-Supervised Learning methods. Even, it is able to perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset, and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 800px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of image transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL is able to use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using the VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure (10), increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images, and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. One of the clustering-based methods is '''DeepCluster''' [7], where each previous version of its representation is used by bootstrapping to produce a target for the next representation. They built a new representation through clustering data points using the prior representation and then classify the target by using the clustered index of each sample. That may result in better image representations. This will avoid the use of negative pairs but it might also cause collapsing to trivial solutions which create a trade-off. <br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations&diff=49321Self-Supervised Learning of Pretext-Invariant Representations2020-12-06T02:06:29Z<p>N2chatti: /* Critiques */</p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations using a large number of data with pre-defined semantic annotations. Some examples of these annotations are class labels [1] and bonding boxes [2], as shown in Figure 1. There is a need for a large number of labeled data that is not the case in all scenarios for finding representations using pre-defined semantic annotations. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. '''In other words, pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community to find image representations that are more visually meaningful and can help in several tasks such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''Self-Supervised Learning'''. Self-Supervised Learning tries to learn deep models that find image representations from the pixels themselves rather than using pre-defined semantic annotated data. As we will show, in self-supervised learning, there is no need for using human-provided class labels or bounding boxes for classification and object detection tasks, respectively. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''Pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict the transformation characteristic. Several Pretext Tasks exist based on the type of used transformation. Two of the most used pretext tasks are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. In the jigsaw task, which is more complicated than the rotation prediction task, unlabeled images are cropped into 9 patches, then, the image is perturbed by randomly permuting the nine patches. Each permutation falls into one of the 35 classes according to a formula. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to greyscale, and image reconstruction, where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
Although the proposed pretext tasks have achieved promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations the are common between the original images and the transformed ones. This idea is supported by the fact that humans are able to recognize these transformed images. This hints us to try to develop a method that obtains image representations that are common between the original and transformed images, in other words, image representations that are transformation invariant. The paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that learns to obtain Self-Supervised image representations that as opposed to Pretext Tasks are transformation invariant and therefore, more semantically meaningful. The performance of the proposed method is evaluated on several Self-Supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in Self-Supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image ,<math>I</math>, in the Dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t)<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two set of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t}}{\tau}) \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t}}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}}}{\tau}) \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers) , <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take out here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math> , increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, are increased. During training a memory bank [], <math>m_I</math>, of dataset image representations are used to access the representations of the dataset images including the negative images. The proposed PIRL model is shown in Figure (4). Finally, the contrastive loss in equation (5) does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
Where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in the last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from the ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
A Faster R-CNN model with a ResNet-50 backbone, pre-trained using PIRL and other Self-Supervised methods, is employed for the object detection task. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on the VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised-based methods, '''for the first time it outperforms Supervised Pre-training on object detection'''. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the pre-trained ResNet-50 model is utilized as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results demonstrate that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pre-trained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, the results show that PIRL the best performance among different Self-Supervised Learning methods. Even, it is able to perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset, and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 800px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of image transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL is able to use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using the VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure (10), increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images, and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. One of the clustering-based methods is '''DeepCluster''' [7], where each previous version of its representation is used by bootstrapping to produce a target for the next representation. They built a new representation through clustering data points using the prior representation and then classify the target by using the clustered index of each sample. That may result in better image representations. This will avoid the use of negative pairs but it might also cause collapsing to trivial solutions which create a trade-off. <br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=49320stat940F212020-12-06T02:03:48Z<p>N2chatti: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]]<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS Summary] || [https://drive.google.com/file/d/1mZVlF2UvJ2lGjuVcN5SYqBuO4jZjuCcU/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation Summary] || [https://drive.google.com/file/d/1OUx64_pTZzCQAdo_fmy_9h9NbuccTnn6/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] || Learn<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition#Source_code Summary] ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification Summary] || [https://www.youtube.com/watch?v=jG57QgY71yU video]<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes Summary]|| Learn<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] || [https://youtu.be/D54qsSkqryk video] or Learn<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||RoBERTa: A Robustly Optimized BERT Pretraining Approach ||[https://openreview.net/forum?id=SyxS0T4tvS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta Summary] || [https://youtu.be/JdfvvYbH-2s Presentation Video]<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT Summary] || Learn<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks Summary] || [https://youtu.be/SENjFF4N45s video] or Learn<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||A critical analysis of self-supervision, or what we can learn from a single image|| [https://openreview.net/pdf?id=B1esx6EYvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION Summary]|| [https://youtu.be/HkkacHrvloE YouTube]<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Self-Supervised Learning of Pretext-Invariant Representations || [https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations Summary] || [https://www.youtube.com/watch?v=IlIPHclzV5E&ab_channel=sinaebrahimifarsangi YouTube] or Learn<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Summary]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49319This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T02:01:04Z<p>N2chatti: /* Source code */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results: ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.<br />
<br />
== Source code ==<br />
The code for this paper is available at [https://github.com/cfchen-duke/ProtoPNet https://github.com/cfchen-duke/ProtoPNet]</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49318This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:59:26Z<p>N2chatti: </p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results: ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.<br />
<br />
== Source code ==<br />
The code for this paper is available at [https://github.com/cfchen-duke/ProtoPNet]</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49317This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:55:34Z<p>N2chatti: </p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results: ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:table2protoPNet.jpg&diff=49316File:table2protoPNet.jpg2020-12-06T01:54:07Z<p>N2chatti: </p>
<hr />
<div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:table1protoPNet.jpg&diff=49315File:table1protoPNet.jpg2020-12-06T01:53:44Z<p>N2chatti: </p>
<hr />
<div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49314This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:52:41Z<p>N2chatti: /* Results: */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results: ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49313This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:51:41Z<p>N2chatti: /* Results: */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results: ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49312This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:50:56Z<p>N2chatti: </p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results: ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br />
<br />
[[File:table2.jpg800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49311This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:43:07Z<p>N2chatti: /* Training Algorithm */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49306This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:18:17Z<p>N2chatti: /* Training Algorithm */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k.<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49305This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:13:33Z<p>N2chatti: /* Examples of reasoning process */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49303This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:09:34Z<p>N2chatti: /* Network Architecture */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg]]<br />
[[File:exp2.jpg]]</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:exp2.jpg&diff=49302File:exp2.jpg2020-12-06T01:08:31Z<p>N2chatti: </p>
<hr />
<div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:exp1.jpg&diff=49301File:exp1.jpg2020-12-06T01:08:09Z<p>N2chatti: </p>
<hr />
<div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49300This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:07:51Z<p>N2chatti: /* Examples of reasoning process */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
The above figure represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
<br />
== Training Algorithm ==<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===<br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg]]<br />
[[File:exp2.jpg]]</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49299This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:06:36Z<p>N2chatti: </p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
The above figure represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.<br />
<br />
== Training Algorithm ==<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.<br />
<br />
=== Examples of reasoning process ===</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49298This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T01:03:12Z<p>N2chatti: /* Network Architecture */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
The above figure represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study)<br />
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores.<br />
These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49297This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T00:55:15Z<p>N2chatti: /* Previous Work */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.<br />
<br />
== Network Architecture ==<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49292This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T00:40:15Z<p>N2chatti: /* Network Architecture */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
<br />
== Network Architecture ==<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49291This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T00:33:41Z<p>N2chatti: /* Network Architecture */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
<br />
== Network Architecture ==<br />
[[File:netarch.jpg|center|1200px|Figure 1: Prototypical part Network Architecture]]</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49290This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T00:31:20Z<p>N2chatti: </p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
<br />
== Network Architecture ==<br />
[[File:netarch.jpg|center|1200px]]</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49289This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T00:31:07Z<p>N2chatti: </p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. <br />
<br />
== Previous Work ==<br />
<br />
== Network Architecture ==<br />
[[File:netarch.jpg|center|600px]]</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:netarch.jpg&diff=49288File:netarch.jpg2020-12-06T00:23:10Z<p>N2chatti: </p>
<hr />
<div></div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_Recognition&diff=49287This Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T00:21:53Z<p>N2chatti: Created page with "== Presented by == Nouha Chatti == Introduction == The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanl..."</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. <br />
The idea is to perform these tasks to process images by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. interpretable = reasoning process when making predictions. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. <br />
There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans.<br />
<br />
<br />
== Previous Work ==<br />
<br />
== Network Architecture ==<br />
[[File:netarch.jpg]]</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval&diff=49283Pre-Training Tasks For Embedding-Based Large-Scale Retrieval2020-12-06T00:10:49Z<p>N2chatti: </p>
<hr />
<div><br />
==Contributors==<br />
Pierre McWhannel wrote this summary with editorial and technical contributions from STAT 946 fall 2020 classmates. The summary is based on the paper "Pre-Training Tasks for Embedding-Based Large-Scale Retrieval" which was presented at ICLR 2020. The authors of this paper are Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar [1].<br />
<br />
==Introduction==<br />
The problem domain in which the paper is positioned is large-scale query-document retrieval. This problem is: given a query/question to collect relevant documents which contain the answer(s) and identify the span of words which contain the answer(s). This problem is often separated into two steps 1) a retrieval phase which reduces a large corpus to a much smaller subset and 2) a reader which is applied to read the shortened list of documents and score them based on their ability to answer the query and identify the span containing the answer, these spans can be ranked according to some scoring function. In the setting of open-domain question answering the query, <math>q</math> is a question and the documents <math>d</math> represent a passage which may contain the answer(s). The focus of this paper is on the retriever question answering (QA), in this setting a scoring function <math>f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} </math> is utilized to map <math>(q,d) \in \mathcal{X} \times \mathcal{Y}</math> to a real value which represents the score of how relevant the passage is to the query. We desire high scores for relevant passages and low scores otherwise and this is the job of the retriever. <br />
<br />
<br />
Certain characteristics desired of the retriever are high recall since the reader will only be applied to a subset meaning there is an opportunity to identify false positives but not false negatives. The other characteristic desired of a retriever is low latency meaning it is computationally efficient. The popular approach to meet these heuristics has been the BM-25 [2] which relies on token-based matching between two high-dimensional and sparse vectors representing <math>q</math> and <math>d</math>. This approach performs well and retrieval can be performed in time sublinear to the number of passages with inverter indexing. The downfall of this type of algorithm is its inability to be optimized for a specific task. An alternate option is BERT [3] or transformer-based models [4] with cross attention between query and passage pairs which can be optimized for a specific task. However, these models suffer from latency as the retriever needs to process all the pairwise combinations of a query with all passages. The next type of algorithm is an embedding-based model that jointly embeds the query and passage in the same embedding space and then uses measures such as cosine distance, inner product, or even the Euclidean distance in this space to get a score. The authors suggest the two-tower models can capture deeper semantic relationships in addition to being able to be optimized for specific tasks. This model is referred to as the "two-tower retrieval model", since it has two transformer-based models where one embeds the query <math>\phi{(\cdot)}</math> and the other the passage <math>\psi{(\cdot)}</math>, then the embeddings can be scored by <math>f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. This model can avoid the latency problems of a cross-layer model by pre-computing <math>\psi{(d)}</math> beforehand and then by utilizing efficient approximate nearest neighbor search algorithms in the embedding space to find the nearest documents.<br />
<br />
<br />
These two-tower retrieval models often use BERT with pre-trained weights, in BERT the pre-training tasks are masked-LM (MLM) and next sentence prediction (NSP). The authors note that the research of pre-training tasks that improve the performance of two-tower retrieval models is an unsolved research problem. This is exactly what the authors have done in this paper. That is, to develop pre-training tasks for the two-tower model based on heuristics and then validated by experimental results. The pre-training tasks suggested are Inverse Cloze Task (ICT), Body First Search (BFS), Wiki Link Prediction (WLP), and combining all three. The authors used the Retrieval Question-Answering ('''ReQA''') benchmark [5] and used two datasets '''SQuAD'''[6] and '''Natural Questions'''[7] for training and evaluating their models.<br />
<br />
<br />
<br />
===Contributions of this paper===<br />
*The two-tower transformer model with proper pre-training can significantly outperform the widely used BM-25 Algorithm.<br />
*Paragraph-level pre-training tasks such as ICT, BFS, and WLP hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains.<br />
*The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart (BoW-MLP)<br />
<br />
==Background on Two-Tower Retrieval Models==<br />
This section is to present the two-tower model in more detail. As seen previously <math>q \in \mathcal{X}</math>, <math> d \in \mathcal{y}</math> are the query and document or passage respectively. The two-tower retrieval model consists of two encoders: <math>\phi: \mathcal{X} \rightarrow \mathbb{R}^k</math> and <math>\psi: \mathcal{Y} \rightarrow \mathbb{R}^k</math>. The tokens in <math>\mathcal{X},\mathcal{Y}</math> are a sequence of tokens representing the query and document respectively. The scoring function is <math> f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. The cross-attention BERT-style model would have <math> f_{\theta,w}(q,d) = \psi_{\theta}{(q \oplus d)^{T}}w </math> as a scoring function and <math> \oplus </math> signifies the concatenation of the sequences of <math>q</math> and <math>d</math>, <math> \theta </math> are the parameters of the cross-attention model. The architectures of these two models can be seen below in figure 1.<br />
<br />
<br/><br />
[[File:two-tower-vs-cross-layer.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Comparisons===<br />
<br />
====Inference====<br />
In terms of inference, two-tower architecture has a distinct advantage as being easily seen from the architecture. Since the query and document are embedded independently, <math> \psi{(d)}</math> can be pre-computed. This is an important advantage as it makes the two-tower method feasible. As an example, the authors state the inner product between 100 query embeddings and 1 million document embeddings only takes hundreds of milliseconds on CPUs, however, for the equivalent calculation for a cross-attention-model it takes hours on GPUs. In addition, they remark the retrieval of the relevant documents can be done in sublinear time with respect <math> |\mathcal{Y}| </math> using a maximum inner product (MIPS) algorithm with little loss in the recall.<br />
<br />
====Learning====<br />
In terms of learning the two-tower model compared to BM-25 has a unique advantage of being possible to be fine-tuned for downstream tasks. This problem is within the paradigm of metric learning as the scoring function can be passed as input to an objective function of the log likelihood <math> \max_{\theta} \space log(p_{\theta}(d|q)) </math>. The conditional probability is represented by a SoftMax and then can be rewritten as:<br />
<br />
<br />
<math> p_{\theta}(d|q) = \frac{exp(f_{\theta}(q,d))}{\sum_{d' \in D} exp(f_{\theta}(q,d'))} \tag{1} \label{eq:sampled_SoftMax}</math><br />
<br />
<br />
Here <math> \mathcal{D} </math> is the set of all possible documents, the denominator is a summation over <math>\mathcal{D} </math>. As is customary they utilized a sampled SoftMax technique to approximate the full-SoftMax. The technique they have employed is by using a small subset of documents in the current training batch, while also using a proper correcting term to ensure the unbiasedness of the partition function.<br />
<br />
====Why do we need pre-training then?====<br />
Although the two-tower model can be fine-tuned for a downstream task getting sufficiently labeled data is not always possible. This is an opportunity to look for better pre-training tasks as the authors have done. The authors suggest getting a set of pre-training task <math> \mathcal{T} </math> with labeled positive pairs of queries of documents/passages and then afterward fine-tuning on the limited amount of data for the downstream task.<br />
<br />
==Pre-Training Tasks==<br />
The authors suggest three pre-training tasks for the encoders of the two-tower retriever, these are ICT, BFS, and WLP. Keep in mind that ICT [8] is not a newly proposed task. These pre-training tasks are developed in accordance with two heuristics they state and our paragraph-level pre-training tasks in contrast to more common sentence-level or word-level tasks:<br />
<br />
# The tasks should capture different resolutions of semantics between query and document. The semantics can be the local context within a paragraph, global consistency within a document, and even semantic relation between two documents.<br />
# Collecting the data for the pre-training tasks should require as little human intervention as possible and be cost-efficient.<br />
<br />
All three of the following tasks can be constructed in a cost-efficient manner and with minimal human intervention by utilizing Wikipedia articles. In figure 2 below the dashed line surrounding the block of text is to be regarded as the document <math> d </math> and the rest of the circled portions of text make up the three queries <math>q_1, q_2, q_3 </math> selected for each task.<br />
<br />
<br/><br />
[[File:pre-train-tasks.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Inverse Cloze Task (ICT)===<br />
This task is given a passage <math> p </math> consisting of <math> n </math> sentences <math>s_{i}, i=1,..,n </math>. <math> s_{i} </math> is randomly sampled and used as the query <math> q </math> and then the document is constructed to be based on the remaining <math> n-1 </math> sentences. This is utilized to meet the local context within a paragraph of the first heuristic. In figure 2 <math> q_1 </math> is an example of a potential query for ICT based on the selected document <math> d </math>.<br />
<br />
===Body First Search (BFS)===<br />
BFS is <math> (q,d) </math> pairs are created by randomly selecting a sentence <math> q_2 </math> and setting it as the query from the first section of a Wikipedia page which contains the selected document <math> d </math>. This is utilized to meet the global context consistency within a document as the first section in Wikipedia generally is an overview of the whole page and they anticipate it to contain information central to the topic. In figure 2 <math> q_2 </math> is an example of a potential query for BFS based on the selected document <math> d </math>.<br />
<br />
===Wiki Link Prediction (WLP)===<br />
The third task is selecting a hyperlink that is within the selected document <math> d </math>, as can be observed in figure 2 "machine learning" is a valid option. The query <math> q_3 </math> will then be a random sentence on the first section of the page where the hyperlinks redirects to. This is utilized to provide the semantic relation between documents.<br />
<br />
<br />
Additionally, Masked LM (MLM) is considered during the experimentations which again is one of the primary pre-training tasks used in BERT.<br />
<br />
==Experiments and Results==<br />
<br />
===Training Specifications===<br />
<br />
*Each tower is constructed from the 12 layers BERT-base model.<br />
*Embedding is achieved by applying a linear layer on the [CLS] token output to get an embedding dimension of 512.<br />
*Sequence lengths of 64 and 288 respectively for the query and document encoders.<br />
*Pre-trained on 32 TPU v3 chips for 100k step with Adam optimizer learning rate of <math> 1 \times 10^{-4} </math> with 0.1 warm-up ratio, followed by linear learning rate decay and batch size of 8192. (~2.5 days to complete)<br />
*Fine tuning on downstream task <math> 5 \times 10^{-5} </math> with 2000 training steps and batch size 512.<br />
*Tokenizer is WordPiece.<br />
<br />
===Pre-Training Tasks Setup===<br />
<br />
*MLM, ICT, BFS, and WLP are used.<br />
*Various combinations of the aforementioned tasks are tested.<br />
*Tasks define <math> (q,d) </math> pairings.<br />
*ICT's document <math> d </math> has the article title and passage separated by a [SEP] symbol as input to the document encoder.<br />
<br />
Below table 1 shows some statistics for the datasets constructed for the ICT, BFS, and WLP tasks. Token counts considered after the tokenizer is applied.<br />
<br />
[[File: wiki-pre-train-statistics.PNG | center]]<br />
<br />
===Downstream Tasks Setup===<br />
Retrieval Question-Answering ('''ReQA''') benchmark used.<br />
<br />
*Datasets: '''SQuAD''' and '''Natural Questions'''.<br />
*Each entry of data from datasets is <math> (q,a,p) </math>, where each element is, query <math> q </math>, answer <math> a </math>, and passage <math> p </math> containing the answer, respectively.<br />
*Authors split the passage to sentences <math>s_i</math> where <math> i=1,...n </math> and <math> n </math> is the number of sentences in the passage.<br />
*The problem is then recast for a given query to retrieve the correct sentence-passage <math> (s,p) </math> from all candidates for all passages split in this fashion.<br />
*They remark their reformulation makes it a more difficult problem.<br />
<br />
Note: Experiments done on '''ReQA''' benchmarks are not entirely open-domain QA retrieval as the candidate <math> (s,p) </math> only cover the training set of QA dataset instead of entire Wikipedia articles. There are some experiments they did with augmented the dataset as will be seen in table 6.<br />
<br />
Below table 2 shows statistics of the datasets used for the downstream QA task.<br />
<br />
<br />
[[File:data-statistics-30.PNG | center]]<br />
<br />
===Evaluation===<br />
<br />
*3 train/test splits: 1%/99%, 5%/95%, and 80%/20%.<br />
*10% of the training set is used as a validation set for hyper-parameter tuning.<br />
*Training never has seen the test queries in the test pairs of <math>(q,d).</math><br />
*Evaluation metric is recall@k since the goal is to capture positives in the top-k results.<br />
<br />
<br />
===Results===<br />
<br />
Table 3 shows the results on the '''SQuAD''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance, with the exception of R@1 for 1%/99% train/test split. BoW-MLP is used to justify the use of a transformer as the encoder since it has more complexity, BoW-MLP here looks up uni-grams from an embedding table, aggregates the embeddings with average pooling, and passes them through a shallow two-layer ML network with <math> tanh </math> activation to generate the final 512-dimensional query/document embeddings.<br />
<br />
[[File:table3-30.PNG | center]]<br />
<br />
Table 4 shows the results on the '''Natural Questions''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance in all testing scenarios.<br />
<br />
[[File:table4-30.PNG | center]]<br />
<br />
====Ablation Study====<br />
The embedding-dimensions, # of layers in BERT, and pre-training tasks are altered for this ablation study as seen in table 5.<br />
<br />
Interestingly ICT+BFS+WLP is only an absolute of 1.5% better than just ICT in the low-data regime. The authors infer this could be representative of when there is insufficient downstream data for training that more global pre-training tasks become increasingly beneficial since BFS and WLP offer this.<br />
<br />
[[File:table5-30.PNG | center]]<br />
<br />
====Evaluation of Open-Domain Retrieval====<br />
Augment ReQA benchmark with large-scale (sentence, evidence passage) pairs extracted from general Wikipedia. This is added as one million external candidate pairs into the existing retrieval candidate set of the ReQA benchmark. <br />
<br />
[[File:table6-30.PNG | center]]<br />
<br />
The 3 talking points about the above results are as follows <br><br />
• The two-tower Transformer models pretrained with ICT+BFS+WLP and ICT substantially outperform the BM-25 baseline. <br><br />
• ICT+BFS+WLP pre-training method consistently improves the ICT pre-training method <br><br />
• Results here are fairly consistent with previous results.<br />
<br />
==Conclusion==<br />
<br />
The authors performed a comprehensive study and have concluded that properly designing paragraph-level pre-training tasks including ICT, BFS, and WLP for a two-tower transformer model can significantly outperform a BM-25 algorithm, which is an unsupervised method using weighted token-matching as a scoring function. The findings of this paper suggest that a two-tower transformer model with proper pre-training tasks can replace the BM25 algorithm used in the retrieval stage to further improve end-to-end system performance. The authors plan to test the pre-training tasks with a larger variety of encoder architectures and trying other corpora than Wikipedia and additionally trying pre-training in contrast to different regularization methods.<br />
<br />
==Critique==<br />
<br />
The authors of this paper suggest from their experiments that the combination of the 3 pretraining tasks ICT, BFS, and WLP yields the best results in general. However, they do acknowledge why two of the three pre-training tasks may not produce equally significant results as the three. From the ablation study and the small margin (1.5%) in which the three tasks achieved outperformed only using ICT in combination with observing ICT outperforming the three as seen in table 6 in certain scenarios. It seems plausible that a subset of two tasks could also potentially outperform the three or reach a similar performance. I believe this should be addressed in order to conclude three pre-training tasks are needed over two, though they never address this comparison I think it would make the paper more complete.<br />
<br />
The two-tower transformer model has two encoders, one for query encoding, and one for passage encoding. But in theory, queries and passages are all textual information. Do we really need two different encoders? Maybe using one encoder to encode both queries and passages can achieve competitive results. I'd like to see if any paper has ever covered this.<br />
<br />
As this paper full under the family of approaches that involves the dual encoder architecture where the query and document are encoded independently of each other and efficient retrieval is achieved using approximate nearest neighbor search as they have used and they have just shown that such dilemma can be resolved with matching-oriented pretraining that imitate the matching between the question and paragraph in open-domain QA. However, the existing pretraining strategies could be highly sample-inefficient and typically require a very large batch size (up to thousands) such that diverse and effective negative question-paragraph pairs could be included in each batch. As mention in Equation \eqref{eq:sampled_SoftMax}, they utilized a sampled SoftMax technique to approximate the full-SoftMax, which has the main advantage which saves a lot of memory during training instead of load the whole dataset to the memory. Another thing that might be interesting to implement is to use dense representation for the 3 pretraining tasks ICT, BFS, and WLP as mention in this paper [9], which also used a different way of in-batch sampling during the training of the encoders and last and not least, instead of using the ICT in training, I would like to compare it with synthetic queries by considering two decoding techniques, which are beam search and nucleus sampling illustrated in this paper [10].<br />
<br />
== References ==<br />
[1] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.<br />
<br />
[2] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.<br />
<br />
[3] Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, 2019.<br />
<br />
[4] Ashish Vaswani and Noam Shazeer, et al. Attention Is All You Need. arXiv:1706.03762, 2017.<br />
<br />
[5] Ahmad, Amin and Constant, Noah and Yang, Yinfei and Cer, Daniel. ReQA: An Evaluation for End-to-End Answer Retrieval Models. arXiv:1907.04780, 2019.<br />
<br />
[6] Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250, 2016.<br />
<br />
[7] Google. NQ: Natural Questions. https://ai.google.com/research/NaturalQuestions<br />
<br />
[8] Kenton Lee and Ming-Wei Chang and Kristina Toutanova. Latent Retrieval for Weakly Supervised Open Domain Question Answering. arXiv:1906.00300, 2019.<br />
<br />
[9] Karpukhin, V., O˘guz, B., Min, S., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. CoRR (04 2020)<br />
<br />
[10] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. International Conference on Learning Representations (ICLR). (2020)</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=42529stat940F212020-10-06T16:25:37Z<p>N2chatti: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || || 1|| || || ||<br />
|-<br />
|Week of Nov 2 || || 2|| || || ||<br />
|-<br />
|Week of Nov 2 || || 3|| || || ||<br />
|-<br />
|Week of Nov 2 || || 4|| || || ||<br />
|-<br />
|Week of Nov 2 || || 5|| || || ||<br />
|-<br />
|Week of Nov 2 || || 6|| || || ||<br />
|-<br />
|Week of Nov 9 || || 7|| || || ||<br />
|-<br />
|Week of Nov 9 || || 8|| || || ||<br />
|-<br />
|Week of Nov 9 || || 9|| || || ||<br />
|-<br />
|Week of Nov 9 || || 10|| || || ||<br />
|-<br />
|Week of Nov 9 || || 11|| || || ||<br />
|-<br />
|Week of Nov 9 || || 12|| || || ||<br />
|-<br />
|Week of Nov 16 || || 13|| || || ||<br />
|-<br />
|Week of Nov 16 || || 14|| || || ||<br />
|-<br />
|Week of Nov 16 || || 15|| || || ||<br />
|-<br />
|Week of Nov 16 || || 16|| || || ||<br />
|-<br />
|Week of Nov 16 || || 17|| || || ||<br />
|-<br />
|Week of Nov 16 || || 18|| || || ||<br />
|-<br />
|Week of Nov 23 || || 19|| || || ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || || 23|| || || ||<br />
|-<br />
|Week of Nov 23 || || 24|| || || ||<br />
|-<br />
|Week of Nov 30 || || 25|| || || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || || 29|| || || ||<br />
|-<br />
|Week of Nov 30 || || 30|| || || ||<br />
|-</div>N2chatti