http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=A227jain&feedformat=atomstatwiki - User contributions [US]2023-09-30T00:43:34ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:a227jain-proposal.jpeg&diff=49863File:a227jain-proposal.jpeg2020-12-11T17:32:56Z<p>A227jain: A227jain uploaded a new version of File:a227jain-proposal.jpeg</p>
<hr />
<div>proposal</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:a227jain-proposal.jpeg&diff=49862File:a227jain-proposal.jpeg2020-12-11T15:30:24Z<p>A227jain: proposal</p>
<hr />
<div>proposal</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_940-Proposal&diff=49861F21-STAT 940-Proposal2020-12-11T15:28:02Z<p>A227jain: updating image name</p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
<br />
<br />
<br />
Project # 1 Group members:<br />
<br />
McWhannel, Pierre<br />
<br />
Yan, Nicole<br />
<br />
Hussein Salamah, Ahmed <br />
<br />
Title: Dense Retrieval for Conversational Information Seeking <br />
<br />
Description:<br />
One of the recognized problems in Information Retrieval (IR) is the conversational search that attracts much attention in form of Conversational Assistants such as Alexa, Siri and Cortana. The users’ needs are the ultimate goal of conversational search systems, in this context the questions are asked sequentially imposing a multi-turn format as the Conversational Information Seeking (CIS) task. TREC Conversational Assistance Track (CAsT) [3] is a multi-turn conversational search task as it contains a large-scale reusable test collection for sequences of conversational queries. The response of this conversational model is not a list of relevant documents, but it is limited to brief response passages with a length of 1 to 3 sentences in length.<br />
<br />
[[File:Screen Shot 2020-10-09 at 1.33.00 PM.png | 300px | Example Queries in CAsT]]<br />
<br />
In [4], the authors focus on improving open domain question answering by including dense representations for retrieval instead of the traditional methods. They have adopted a simple dual-encoder framework to construct a learnable retriever on large collections. We want to adopt this dense representation for the conversational model in the CAsT task and compare it with the performance of the other approaches in literature. The performance will be indicated by using graded relevance on five point, which are Fails to meet, Slightly meets, Moderately meets, Highly meets, and Fully meets.<br />
<br />
We aim to further improve our system performance by integrating the following techniques:<br />
<br />
• Paragraph-level pre-training tasks: ICT, BFS, and WLP [1]<br />
<br />
• ANCE training: periodically using checkpoints to encode documents, from which the strong negatives close to the relevant document would be used as next training negatives [5]<br />
<br />
In summary, this project is exploratory in nature as we will be trying to use state-of-art Dense Passage Retrieval techniques (based on BERT) [4, 6], in a question answering (QA) problem. Current first-stage-retrieval approaches mainly rely on bag-of-words models. In this project, we hope to explore the feasibility of using state-of-art methods such as BERT. We will first compare how these perform on the TREC CAsT datasets [3] against the results retrieved using BM25. After these first points of comparison we will next explore methods of improving DPR by exploring one or more techniques that are made to improve the performance of DPR. [1, 5].<br />
<br />
References<br />
<br />
[1] Wei-Cheng Chang et al. Pre-training Tasks for Embedding-based Large-scale Retrieval. 2020. arXiv: 2002.03932 [cs.LG].<br />
<br />
[2] Zhuyun Dai and Jamie Callan. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. 2019. arXiv: 1910.10687 [cs.IR].<br />
<br />
[3] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. TREC CAsT 2019: The Conversational Assistance Track Overview. 2020. arXiv: 2003.13624 [cs.IR].<br />
<br />
[4] Vladimir Karpukhin et al. Dense Passage Retrieval for Open-Domain Ques- tion Answering. 2020. arXiv: 2004.04906 [cs.CL].<br />
<br />
[5] Lee Xiong et al. Approximate Nearest Neighbor Negative Contrastive Learn- ing for Dense Text Retrieval. 2020. arXiv: 2007.00808 [cs.IR].<br />
<br />
[6] Jingtao Zhan et al. RepBERT: Contextualized Text Embeddings for First- Stage Retrieval. 2020. arXiv: 2006.15498 [cs.IR].<br />
<br />
<br />
<br />
Project # 2 Group members:<br />
<br />
Singh, Gursimran<br />
<br />
Sharma, Govind<br />
<br />
Chanana, Abhinav<br />
<br />
Title: Quick Text Description using Headline Generation and Text To Image Conversion<br />
<br />
Description: An automatic tool to generate short description based on long textual data is a useful mechanism to share quick information. Most of the current approaches involve summarizing the text using varied deep learning approaches from Transformers to different RNNs. For this project, instead of building a standard text summarizer, we aim to provide two separate utilities for generating a quick description of the text. First, we plan to develop a model that produces a headline for the long textual data, and second, we are intending to generate an image describing the text. <br />
<br />
Headline Generation - Headline generation is a specific case of text summarization where the output is generally a combination of few words that gives an overall outcome from the text. In most cases, text summarization is an unsupervised learning problem. But, for the headline generation, we have the original headlines available in our training dataset that makes it a supervised learning task. We plan to experiment with different Recurrent Neural Networks like LSTMs and GRUs with varied architectures. For model evaluation, we are considering BERTScore using which we can compare the reference headline with the automatically generated headline from the model. We also aim to explore Attention and Transformer Networks for the text (headline) generation. We will make use of the currently available techniques mentioned in the various research papers but also try to develop our own architecture if the previous methods don't reveal reliable results on our dataset. Therefore, this task would primarily fit under the category of application of deep learning to a particular domain, but could also include some components of new algorithm design.<br />
<br />
Text to Image Conversion - Generation or synthesis of images from a short text description is another very interesting application domain in deep learning. One approach for image generation is based on mapping image pixels to specific features as described by the discriminative feature representation of the text. Recurrent Neural Networks have been successfully used in learning such feature representations of text. This approach is difficult to generalize because the recognition of discriminative features for texts in different domains is not an easy task and it requires domain expertise. Different generative methods have been used including Variational Recurrent Auto-Encoders and its extension in Deep Recurrent Attention Writer (DRAW). We plan to experiment with Generative Adversarial Networks (GAN). Application of GANs on domain-specific datasets has been done but we aim to apply different variants of GANs on the Microsoft COCO dataset which has been used in other architectures. The analysis will be focusing on how well GANs are able to generalize when compared to other alternatives on the given dataset.<br />
<br />
Scope - The above models will be trained independently on different datasets. Therefore, for a particular text, only one of the two functionalities will be available.<br />
<br />
<br />
<br />
Project # 3 Group members:<br />
<br />
Sikri, Gaurav<br />
<br />
Bhatia, Jaskirat<br />
<br />
Title: Malware Prediction<br />
<br />
Description: The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways. With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.<br />
<br />
In this project, we plan to predict how likely a machine is to be infected by malware given its current specifications(total 82) like: company name, Firewall status, physical RAM, etc.<br />
<br />
<br />
<br />
Project # 4 Group members:<br />
<br />
Maleki, Danial<br />
<br />
Rasoolijaberi, Maral<br />
<br />
Title: Binary Deep Neural Network for the domain of Pathology<br />
<br />
Description: The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. We want to investigate the possibility of using these types of networks in the domain of histopathology as it has gigapixels images which make the use of them very useful.<br />
<br />
<br />
Project # 5 Group members:<br />
<br />
Jain, Abhinav<br />
<br />
Bathla, Gautam<br />
<br />
Title: Zero short learning with AREN and HUSE<br />
<br />
Description: Attention Region Discovery and Adaptive Thresholding module are taken from the idea of “Attentive Region Embedding Network for Zero-shot Learning” (https://openaccess.thecvf.com/content_CVPR_2019/papers/Xie_Attentive_Region_Embedding_Network_for_Zero-Shot_Learning_CVPR_2019_paper.pdf) whereas the idea for projecting image and text embeddings into a shared space was taken by “HUSE: Hierarchical Universal Semantic Embeddings” (https://arxiv.org/pdf/1911.05978.pdf). The motivation is that the attribute embedding can provide some complementary information to the model which can be learned to represent into a shared space and hence a better prediction to the zero-shot learning can be made. Also, the Squeeze and Excitation layer showed some impressive results when applied to the feature extraction part of the model, therefore we thought of re-weighting the channels first before applying the self-attention module so that the model can give even better attention to the image. The paper “Attentive Region Embedding Network for Zero-shot Learning” does not make use of class attributes of the classes present in the dataset, therefore we wanted to make use of these attributes and see if the model can make use of this new information as well.<br />
<br />
[[File:a227jain-proposal.jpeg | 300px | Architecture diagram]]<br />
<br />
<br />
Project # 6 Group members:<br />
<br />
You, Bowen<br />
<br />
Avilez, Jose<br />
<br />
Mahmoud, Mohammad<br />
<br />
Wu, Mohan<br />
<br />
Title: Deep Learning Models in Volatility Forecasting<br />
<br />
Description: Price forecasting has become a very hot topic in the financial industry in recent years. We are however very interested in the volatility of such financial instruments. We propose a new deep learning architecture or model to predict volatility and apply our model to real life datasets of various financial products. We will analyze our results and compare them to more traditional methods.<br />
<br />
<br />
Project # 7 Group members:<br />
<br />
Chen, Meixi<br />
<br />
Shen, Wenyu<br />
<br />
Title: Through the Lens of Probability Theory: A Comparison Study of Bayesian Deep Learning Methods<br />
<br />
Description: Deep neural networks have been known as black box models, but they can be made less mysterious when adopting a Bayesian approach. From a Bayesian perspective, one is able to assign uncertainty on the weights instead of having single point estimates, which allows for a better interpretability of deep learning models. However, Bayesian deep learning methods are often intractable due an increase amount of parameters and often times don't have as good performance. In this project, we will study different BDL methods such as Bayesian CNN using variational inference and Laplace approximation, with applications on image classification, and we will try to propose improvements where possible.<br />
<br />
<br />
Project # 8 Group members:<br />
<br />
Avilez, Jose<br />
<br />
Title: A functional universal approximation theorem<br />
<br />
Description: In the seminal paper "Approximation by superpositions of a sigmoidal function", Cybenko gave a simple proof using elementary functional analysis that a certain class of functions, called discriminatory functions, serve as valid activation functions for universal neural approximators. The objective of our project is three-fold:<br />
<br />
1) Prove a converse of Cybenko's Universal Approximation Theorem by means of the Stone-Weierstrass theorem<br />
<br />
2) Provide examples and non-examples of Cybenko's discriminatory functions<br />
<br />
3) Construct a neural network for functional data (i.e. data arising in function spaces) and prove a universal approximation theorem for Lp spaces.<br />
<br />
References:<br />
<br />
[1] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.<br />
<br />
[2] Folland, Gerald B. Real analysis: modern techniques and their applications. Vol. 40. John Wiley & Sons, 1999.<br />
<br />
[3] Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences, 4.<br />
<br />
<br />
<br />
Project # 9 Group members:<br />
<br />
Sikaroudi, Milad<br />
<br />
Ashrafi Fashi, Parsa<br />
<br />
Title: '''Magnification Generalization with Model-Agnostic Semantic Features in Histopathology Images'''<br />
<br />
Many of the embedding methods learn the subspace for only a specific magnification. However, one of the main challenges in histopathology image embedding is the different magnification levels for indexing of a Whole Slide Indexing (WSI) image [1]. It is well-known that significantly different patterns may exist at different magnification levels of a WSI [2]. <br />
It is useful to train an embedding space for discriminating the histopathology patches regardless of their magnifications. That would lead to learning more compact WSI representations. It has been an arduous task because of the significant domain shifts between different magnification levels with noticeably different patterns. The performance of conventional deep neural networks tends to degrade in the presence of a domain shift, such as the gathering of data from different centers. In this study for the first time, we are going to introduce different magnification levels as a domain shift to see if we can generalize to in-common features in different magnification levels by means of a domain generalization technique, known as Model Agnostic Learning of Semantic Features. The hypothesis is that the statistics of retrieval for the model trained using episodic domain generalization will not degrade as much as the baseline when there is a domain shift. <br />
<br />
[1] Sellaro, Tiffany L., et al. "Relationship between magnification and resolution in digital pathology systems." Journal of pathology informatics 4 (2013).<br />
<br />
[2] Zaveri, Manit, et al. "Recognizing Magnification Levels in Microscopic Snapshots." arXiv preprint arXiv:2005.03748 (2020).<br />
<br />
<br />
Project # 10 Group members:<br />
<br />
Torabian, Parsa<br />
<br />
Ebrahimi Farsangi, Sina<br />
<br />
Moayyedi, Arash<br />
<br />
Title: Meta-Learning Regularizers for Few-Shot Classification Models<br />
<br />
Our project aims at exploring the effects of self-supervised pre-training on few-shot classification. We draw inspiration from the paper “When Does Self-supervision Improve Few-shot Learning?”[1] where the authors analyse the effects of using the Jigsaw puzzle[2] and rotation tasks as regularizers for training Prototypical Networks[3] and Model-Agnostic Meta-Learning (MAML)[4] networks. <br />
<br />
The introduced paper analyzes the effects of regularizing meta-learning models using self-supervised loss, based on rotation and Jigsaw tasks. It is conventionally thought that one of the reasons MAML and other optimization based meta-learning algorithms work well is due to initializing a network into a task-generalizable state[5]. In this project, we will be looking at the effects of self-supervised pre-training, as presumably it will initialize the network into a better state than random, and potentially improve subsequent meta-learning. We will compare the effects of using self-supervised methods as pre-training, as regularization, and the combination of both. The effects of other self-supervised learning tasks, such as discoloration and flipping, will be studied as well. We will also look at which combination of tasks, whether interlaced or applied sequentially, work better and complement one another. We will evaluate our final results on the Omniglot and Mini-Imagenet datasets. These improvements will later be compared with their application on other few-shot learning methods, including first-order MAML and Matching Networks.<br />
<br />
References:<br />
<br />
[1] https://arxiv.org/abs/1910.03560<br />
<br />
[2] https://arxiv.org/abs/1603.09246<br />
<br />
[3] https://arxiv.org/abs/1703.05175 <br />
<br />
[4] https://arxiv.org/abs/1703.03400<br />
<br />
[5] https://arxiv.org/abs/2003.11539<br />
<br />
<br />
Project # 11 Group Members:<br />
<br />
Shikhar Sakhuja: s2sakhuj@uwaterloo.ca <br />
<br />
Introduction:<br />
<br />
Controller Area Network (CAN bus) is a vehicle bus standard that allows Electronic Control Units (ECU) within an automobile to communicate with each other without the need for a host computer. Modern automobiles might have up to 70 ECUs for various subsystems such as Engine, Transmission, Breaking, etc. The ECUs exchange messages on the CAN bus and allow for a lot of modern vehicle capabilities such as automatic start/stop, electric park brakes, lane detection, collision avoidance, and more. Each message exchanged on the bus is encoded as a 29-bit packet. These 29 bits consist of a combination of Parameter Group Number (PGN), message priority, and the source address of the message. Parameter groups can be, for example, engine temperature which could include coolant temperature, fuel temperature, etc. The PGN itself includes information such as priority, reserved status, data page, and PDU format. Lastly, the source address maps the message to the ECU it originates from. <br />
<br />
Goals:<br />
<br />
(1) This project aims to use messages exchanged on the CAN bus of a Challenger Truck collected by the Embedded Systems Group at the University of Waterloo. The data exists in a temporal format with a new message exchanged periodically. The goals of this project are two folds:<br />
<br />
(2) Predicting the PGN and source address of message N exchanged on the bus, given messages 1 to N-1. We might also explore predicting attributes within the PGN. <br />
Predicting the delay between messages N-1 and N, given the delay between each pair of consecutive messages leading up to message N-1. <br />
<br />
Potential Approach:<br />
<br />
For the first goal, we intend to experiment with RNN models along with Attention modules since they have shown promising results in text generation/prediction. <br />
<br />
The second goal is more of an investigative problem where we intend to use regression techniques powered by Neural Networks to predict delays between messages N-1 and N.<br />
<br />
<br />
<br />
<br />
<br />
Project # 12 Group members:<br />
<br />
Hemati, Sobhan <br />
<br />
Meaney, Cameron <br />
<br />
Title: Representation learning of gigapixel histopathology images using PointNet a permutation invariant neural network<br />
<br />
Description:<br />
<br />
In recent years, there has been a significant growth in the amount of information available in digital pathology archives. This data is valuable because of its potential uses in research, education, and pathologic diagnosis. As a result, representation learning of histopathology whole slide images (WSIs) has attracted significant attention and become an active area of research. Unfortunately, scientific progress with these data have been difficult because of challenges inherent to the data itself. These challenges include highly complex textures of different tissue types, color variations caused by different stainings, and most notably, the size of the images which are often larger than 50,000x50,000 pixels. Additionally, these images are multi-resolution meaning that each WSI may contain images from different zoom levels, primarily 5X, 10X, 20X, and 40X. With the advent of deep learning, there is optimism that these challenges can be overcome. The main challenge in this approach is that the sheer size of the images makes it infeasible (or impossible) to obtain a vector representation for a WSI, which is a necessary step in order to leverage deep learning algorithms. In practice, this is often bypassed by considering ‘patches’ of the WSI of smaller sizes, a set of which is meant to represent the full WSI. This approach lead to a set representation for a WSI. However, unlike traditional image or sequence models, deep networks that process and learn permutation invariant representations from sets is still a developing area of research. Recent attempts at this include Multi-instance Learning Schemes, Deep Set, and Set Transformers. A particularly successful attempt in developing a deep neural network for set representation in called PointNet which was developed for classification and segmentation of 3D objects and point clouds. In PointNet, each set is represented using a set of (x,y,z) coordinates, and the network is designed to learn a permutation invariant global representation for each set and then use this representation for classification or segmentation.<br />
<br />
In this project, we attempt to first extend the PointNet network to a convolutional PointNet network such that it uses a set of image patches rather than (x,y,z) coordinates to learn the universal permutation invariant representation. Then, we attempt improve the representational power of PointNet as a permutation invariant neural network. For the first part, the main challenge is that while PointNet has been designed for processing of sets with the same size, in WSIs, the size of the image and therefore number of patches is not fixed. For this reason, we will need to develop an idea which enables CNN-PointNet to process sets with different sizes. One possible solution is to use fake members to standardize the set size and then remove the effect of these fake members in backpropagation using a masking scheme. For the second part, the PointNet network can be improved in many ways. For example, the rotation matrix used is not a real rotation matrix as the orthogonality is incorporated using a regularization term. However, using a projected gradient technique and the existence of a closed form solution for obtaining nearest orthogonal matrix to a given matrix (Orthogonal Procrustes Problem) we can keep the exact orthogonality constraint and obtain a real rotation matrix. This exact orthogonality is geometrically important as, otherwise, this transformation will likely corrupt the neighborhood structure of the points in each set. Furthermore, PointNet uses very simple symmetric function (max pooling) as a set approximator, however there more powerful symmetric functions like statistical moments, power-sum with a trainable parameter, and other set approximators can be used. It would be interesting to see how more complicated symmetric functions can improve the representational power of PointNet to achieve more discriminative permutation invariant representations for each set (in this case WSIs).<br />
<br />
Project # 13 Group Members:<br />
<br />
Syed Saad Naseem ssnaseem@uwaterloo.ca<br />
<br />
Title: Text classification of topics related to COVID-19 on social media using deep learning<br />
The COVID-19 pandemic has become a public health emergency and a critical socioeconomic issue worldwide. It is changing the way we live and do business. Social media is a rich source of data about public opinion on different types of topics including topics about COVID-19. I plan on using Reddit to get a dataset of posts and comments from users related to COVID-19 and since Reddit is divided into communities so the posts and comments are also clustered by the topic of the community, for example, posts from the political subreddit will have posts about politics.<br />
<br />
I plan to make a classifier that will take a given text and will tell what the text of talking about for example it can be talking about politics, studies, relationships, etc. The goals of this project are to:<br />
<br />
• Scrape a dataset from Reddit from different communities<br />
<br />
• Train a deep learning model (CNN or RNN model) to classify a given text into the possible categories<br />
<br />
• Test the model on posts from social talking about COVID-19<br />
<br />
<br />
<br />
Project # 14 Group members<br />
<br />
Edwards, John<br />
<br />
Title: Click-through Rate Prediction Using Historical User Data<br />
<br />
Click-through Rate (CTR) prediction consists of forecasting a users probability of clicking on a specified target. CTR is used largely by online advertising systems which sell ad space on a cost-per-click pricing model to asses the likenesses of a user clicking on a targeted ad. <br />
<br />
User session logs provides firms with an assortment of individual specific features, a large - number of which are categorical. Additionally, advertisers posses multiple ad candidates each with their own respective features. The challenge of CTR prediction is to design a model which encompass the Interacting effects of these features to produced high quality forecasts and pair users with advertisements with high potential for click conversion. Additionally computational efficiency must balanced with model complexity so that predictions can be done in an online setting throughout the progression of a users session.<br />
<br />
This projects primary objective will be to attempt creating a new Deep Neural Network (DNN) architecture for producing high quality CTR forecasts while also satisfying the aforementioned challenges.<br />
<br />
While many variants of DNN for CTR predictions exists they can differ greatly in application setting. Specifically, the vast majority of models evaluate each user-ad interaction independently. They fail to utlise information contained for each specific users’ historical add impressions. There is only a small subset of models [1,2,4] which have tried to address this by adapting architectures to utilize historical information. This projects focus will be within this application setting exploring new architectures which can better utilise information contained within a users historical behaviour. <br />
<br />
This projects implementation will consist of the following action plan:<br />
Develop a new model architecture inspired by innovations of previous CTR network designs which lacked the ability to adapt their model to utlize a users historical data [4,5].<br />
Use the public benchmark Avito advertising dataset to empirically evaluate the new models performance and compare it against previous state of the art models for this data set. <br />
<br />
References:<br />
<br />
[1] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Li, Li & Liu, Zhaojie & Du, Yanlong. (2019). Click-Through Rate Prediction with the User Memory Network. <br />
<br />
[2] Ouyang, Wentao & Zhang, Xiuwu & Li, Li & Zou, Heng & Xing, Xin & Liu, Zhaojie & Du, Yanlong. (2019). Deep Spatio-Temporal Neural Networks for Click-Through Rate Prediction. 2078-2086. 10.1145/3292500.3330655. <br />
<br />
[3] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Qi, Chao & Liu, Zhaojie & Du, Yanlong. (2019). Representation Learning-Assisted Click-Through Rate Prediction. 4561-4567. 10.24963/ijcai.2019/634. <br />
<br />
[4] Li, Zeyu, Wei Cheng, Yang Chen, H. Chen and W. Wang. “Interpretable Click-Through Rate Prediction through Hierarchical Attention.” Proceedings of the 13th International Conference on Web Search and Data Mining (2020)<br />
<br />
[5] Zhou, Guorui & Gai, Kun & Zhu, Xiaoqiang & Song, Chenru & Fan, Ying & Zhu, Han & Ma, Xiao & Yan, Yanghui & Jin, Junqi & Li, Han. (2018). Deep Interest Network for Click-Through Rate Prediction. 1059-1068. 10.1145/3219819.3219823.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_940-Proposal&diff=49860F21-STAT 940-Proposal2020-12-11T15:21:57Z<p>A227jain: Updated proposal</p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
<br />
<br />
<br />
Project # 1 Group members:<br />
<br />
McWhannel, Pierre<br />
<br />
Yan, Nicole<br />
<br />
Hussein Salamah, Ahmed <br />
<br />
Title: Dense Retrieval for Conversational Information Seeking <br />
<br />
Description:<br />
One of the recognized problems in Information Retrieval (IR) is the conversational search that attracts much attention in form of Conversational Assistants such as Alexa, Siri and Cortana. The users’ needs are the ultimate goal of conversational search systems, in this context the questions are asked sequentially imposing a multi-turn format as the Conversational Information Seeking (CIS) task. TREC Conversational Assistance Track (CAsT) [3] is a multi-turn conversational search task as it contains a large-scale reusable test collection for sequences of conversational queries. The response of this conversational model is not a list of relevant documents, but it is limited to brief response passages with a length of 1 to 3 sentences in length.<br />
<br />
[[File:Screen Shot 2020-10-09 at 1.33.00 PM.png | 300px | Example Queries in CAsT]]<br />
<br />
In [4], the authors focus on improving open domain question answering by including dense representations for retrieval instead of the traditional methods. They have adopted a simple dual-encoder framework to construct a learnable retriever on large collections. We want to adopt this dense representation for the conversational model in the CAsT task and compare it with the performance of the other approaches in literature. The performance will be indicated by using graded relevance on five point, which are Fails to meet, Slightly meets, Moderately meets, Highly meets, and Fully meets.<br />
<br />
We aim to further improve our system performance by integrating the following techniques:<br />
<br />
• Paragraph-level pre-training tasks: ICT, BFS, and WLP [1]<br />
<br />
• ANCE training: periodically using checkpoints to encode documents, from which the strong negatives close to the relevant document would be used as next training negatives [5]<br />
<br />
In summary, this project is exploratory in nature as we will be trying to use state-of-art Dense Passage Retrieval techniques (based on BERT) [4, 6], in a question answering (QA) problem. Current first-stage-retrieval approaches mainly rely on bag-of-words models. In this project, we hope to explore the feasibility of using state-of-art methods such as BERT. We will first compare how these perform on the TREC CAsT datasets [3] against the results retrieved using BM25. After these first points of comparison we will next explore methods of improving DPR by exploring one or more techniques that are made to improve the performance of DPR. [1, 5].<br />
<br />
References<br />
<br />
[1] Wei-Cheng Chang et al. Pre-training Tasks for Embedding-based Large-scale Retrieval. 2020. arXiv: 2002.03932 [cs.LG].<br />
<br />
[2] Zhuyun Dai and Jamie Callan. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. 2019. arXiv: 1910.10687 [cs.IR].<br />
<br />
[3] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. TREC CAsT 2019: The Conversational Assistance Track Overview. 2020. arXiv: 2003.13624 [cs.IR].<br />
<br />
[4] Vladimir Karpukhin et al. Dense Passage Retrieval for Open-Domain Ques- tion Answering. 2020. arXiv: 2004.04906 [cs.CL].<br />
<br />
[5] Lee Xiong et al. Approximate Nearest Neighbor Negative Contrastive Learn- ing for Dense Text Retrieval. 2020. arXiv: 2007.00808 [cs.IR].<br />
<br />
[6] Jingtao Zhan et al. RepBERT: Contextualized Text Embeddings for First- Stage Retrieval. 2020. arXiv: 2006.15498 [cs.IR].<br />
<br />
<br />
<br />
Project # 2 Group members:<br />
<br />
Singh, Gursimran<br />
<br />
Sharma, Govind<br />
<br />
Chanana, Abhinav<br />
<br />
Title: Quick Text Description using Headline Generation and Text To Image Conversion<br />
<br />
Description: An automatic tool to generate short description based on long textual data is a useful mechanism to share quick information. Most of the current approaches involve summarizing the text using varied deep learning approaches from Transformers to different RNNs. For this project, instead of building a standard text summarizer, we aim to provide two separate utilities for generating a quick description of the text. First, we plan to develop a model that produces a headline for the long textual data, and second, we are intending to generate an image describing the text. <br />
<br />
Headline Generation - Headline generation is a specific case of text summarization where the output is generally a combination of few words that gives an overall outcome from the text. In most cases, text summarization is an unsupervised learning problem. But, for the headline generation, we have the original headlines available in our training dataset that makes it a supervised learning task. We plan to experiment with different Recurrent Neural Networks like LSTMs and GRUs with varied architectures. For model evaluation, we are considering BERTScore using which we can compare the reference headline with the automatically generated headline from the model. We also aim to explore Attention and Transformer Networks for the text (headline) generation. We will make use of the currently available techniques mentioned in the various research papers but also try to develop our own architecture if the previous methods don't reveal reliable results on our dataset. Therefore, this task would primarily fit under the category of application of deep learning to a particular domain, but could also include some components of new algorithm design.<br />
<br />
Text to Image Conversion - Generation or synthesis of images from a short text description is another very interesting application domain in deep learning. One approach for image generation is based on mapping image pixels to specific features as described by the discriminative feature representation of the text. Recurrent Neural Networks have been successfully used in learning such feature representations of text. This approach is difficult to generalize because the recognition of discriminative features for texts in different domains is not an easy task and it requires domain expertise. Different generative methods have been used including Variational Recurrent Auto-Encoders and its extension in Deep Recurrent Attention Writer (DRAW). We plan to experiment with Generative Adversarial Networks (GAN). Application of GANs on domain-specific datasets has been done but we aim to apply different variants of GANs on the Microsoft COCO dataset which has been used in other architectures. The analysis will be focusing on how well GANs are able to generalize when compared to other alternatives on the given dataset.<br />
<br />
Scope - The above models will be trained independently on different datasets. Therefore, for a particular text, only one of the two functionalities will be available.<br />
<br />
<br />
<br />
Project # 3 Group members:<br />
<br />
Sikri, Gaurav<br />
<br />
Bhatia, Jaskirat<br />
<br />
Title: Malware Prediction<br />
<br />
Description: The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways. With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.<br />
<br />
In this project, we plan to predict how likely a machine is to be infected by malware given its current specifications(total 82) like: company name, Firewall status, physical RAM, etc.<br />
<br />
<br />
<br />
Project # 4 Group members:<br />
<br />
Maleki, Danial<br />
<br />
Rasoolijaberi, Maral<br />
<br />
Title: Binary Deep Neural Network for the domain of Pathology<br />
<br />
Description: The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. We want to investigate the possibility of using these types of networks in the domain of histopathology as it has gigapixels images which make the use of them very useful.<br />
<br />
<br />
Project # 5 Group members:<br />
<br />
Jain, Abhinav<br />
<br />
Bathla, Gautam<br />
<br />
Title: Zero short learning with AREN and HUSE<br />
<br />
Description: Attention Region Discovery and Adaptive Thresholding module are taken from the idea of “Attentive Region Embedding Network for Zero-shot Learning” (https://openaccess.thecvf.com/content_CVPR_2019/papers/Xie_Attentive_Region_Embedding_Network_for_Zero-Shot_Learning_CVPR_2019_paper.pdf) whereas the idea for projecting image and text embeddings into a shared space was taken by “HUSE: Hierarchical Universal Semantic Embeddings” (https://arxiv.org/pdf/1911.05978.pdf). The motivation is that the attribute embedding can provide some complementary information to the model which can be learned to represent into a shared space and hence a better prediction to the zero-shot learning can be made. Also, the Squeeze and Excitation layer showed some impressive results when applied to the feature extraction part of the model, therefore we thought of re-weighting the channels first before applying the self-attention module so that the model can give even better attention to the image. The paper “Attentive Region Embedding Network for Zero-shot Learning” does not make use of class attributes of the classes present in the dataset, therefore we wanted to make use of these attributes and see if the model can make use of this new information as well.<br />
<br />
[[File:a227jain-proposal.png | 300px | Architecture diagram]]<br />
<br />
<br />
Project # 6 Group members:<br />
<br />
You, Bowen<br />
<br />
Avilez, Jose<br />
<br />
Mahmoud, Mohammad<br />
<br />
Wu, Mohan<br />
<br />
Title: Deep Learning Models in Volatility Forecasting<br />
<br />
Description: Price forecasting has become a very hot topic in the financial industry in recent years. We are however very interested in the volatility of such financial instruments. We propose a new deep learning architecture or model to predict volatility and apply our model to real life datasets of various financial products. We will analyze our results and compare them to more traditional methods.<br />
<br />
<br />
Project # 7 Group members:<br />
<br />
Chen, Meixi<br />
<br />
Shen, Wenyu<br />
<br />
Title: Through the Lens of Probability Theory: A Comparison Study of Bayesian Deep Learning Methods<br />
<br />
Description: Deep neural networks have been known as black box models, but they can be made less mysterious when adopting a Bayesian approach. From a Bayesian perspective, one is able to assign uncertainty on the weights instead of having single point estimates, which allows for a better interpretability of deep learning models. However, Bayesian deep learning methods are often intractable due an increase amount of parameters and often times don't have as good performance. In this project, we will study different BDL methods such as Bayesian CNN using variational inference and Laplace approximation, with applications on image classification, and we will try to propose improvements where possible.<br />
<br />
<br />
Project # 8 Group members:<br />
<br />
Avilez, Jose<br />
<br />
Title: A functional universal approximation theorem<br />
<br />
Description: In the seminal paper "Approximation by superpositions of a sigmoidal function", Cybenko gave a simple proof using elementary functional analysis that a certain class of functions, called discriminatory functions, serve as valid activation functions for universal neural approximators. The objective of our project is three-fold:<br />
<br />
1) Prove a converse of Cybenko's Universal Approximation Theorem by means of the Stone-Weierstrass theorem<br />
<br />
2) Provide examples and non-examples of Cybenko's discriminatory functions<br />
<br />
3) Construct a neural network for functional data (i.e. data arising in function spaces) and prove a universal approximation theorem for Lp spaces.<br />
<br />
References:<br />
<br />
[1] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.<br />
<br />
[2] Folland, Gerald B. Real analysis: modern techniques and their applications. Vol. 40. John Wiley & Sons, 1999.<br />
<br />
[3] Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences, 4.<br />
<br />
<br />
<br />
Project # 9 Group members:<br />
<br />
Sikaroudi, Milad<br />
<br />
Ashrafi Fashi, Parsa<br />
<br />
Title: '''Magnification Generalization with Model-Agnostic Semantic Features in Histopathology Images'''<br />
<br />
Many of the embedding methods learn the subspace for only a specific magnification. However, one of the main challenges in histopathology image embedding is the different magnification levels for indexing of a Whole Slide Indexing (WSI) image [1]. It is well-known that significantly different patterns may exist at different magnification levels of a WSI [2]. <br />
It is useful to train an embedding space for discriminating the histopathology patches regardless of their magnifications. That would lead to learning more compact WSI representations. It has been an arduous task because of the significant domain shifts between different magnification levels with noticeably different patterns. The performance of conventional deep neural networks tends to degrade in the presence of a domain shift, such as the gathering of data from different centers. In this study for the first time, we are going to introduce different magnification levels as a domain shift to see if we can generalize to in-common features in different magnification levels by means of a domain generalization technique, known as Model Agnostic Learning of Semantic Features. The hypothesis is that the statistics of retrieval for the model trained using episodic domain generalization will not degrade as much as the baseline when there is a domain shift. <br />
<br />
[1] Sellaro, Tiffany L., et al. "Relationship between magnification and resolution in digital pathology systems." Journal of pathology informatics 4 (2013).<br />
<br />
[2] Zaveri, Manit, et al. "Recognizing Magnification Levels in Microscopic Snapshots." arXiv preprint arXiv:2005.03748 (2020).<br />
<br />
<br />
Project # 10 Group members:<br />
<br />
Torabian, Parsa<br />
<br />
Ebrahimi Farsangi, Sina<br />
<br />
Moayyedi, Arash<br />
<br />
Title: Meta-Learning Regularizers for Few-Shot Classification Models<br />
<br />
Our project aims at exploring the effects of self-supervised pre-training on few-shot classification. We draw inspiration from the paper “When Does Self-supervision Improve Few-shot Learning?”[1] where the authors analyse the effects of using the Jigsaw puzzle[2] and rotation tasks as regularizers for training Prototypical Networks[3] and Model-Agnostic Meta-Learning (MAML)[4] networks. <br />
<br />
The introduced paper analyzes the effects of regularizing meta-learning models using self-supervised loss, based on rotation and Jigsaw tasks. It is conventionally thought that one of the reasons MAML and other optimization based meta-learning algorithms work well is due to initializing a network into a task-generalizable state[5]. In this project, we will be looking at the effects of self-supervised pre-training, as presumably it will initialize the network into a better state than random, and potentially improve subsequent meta-learning. We will compare the effects of using self-supervised methods as pre-training, as regularization, and the combination of both. The effects of other self-supervised learning tasks, such as discoloration and flipping, will be studied as well. We will also look at which combination of tasks, whether interlaced or applied sequentially, work better and complement one another. We will evaluate our final results on the Omniglot and Mini-Imagenet datasets. These improvements will later be compared with their application on other few-shot learning methods, including first-order MAML and Matching Networks.<br />
<br />
References:<br />
<br />
[1] https://arxiv.org/abs/1910.03560<br />
<br />
[2] https://arxiv.org/abs/1603.09246<br />
<br />
[3] https://arxiv.org/abs/1703.05175 <br />
<br />
[4] https://arxiv.org/abs/1703.03400<br />
<br />
[5] https://arxiv.org/abs/2003.11539<br />
<br />
<br />
Project # 11 Group Members:<br />
<br />
Shikhar Sakhuja: s2sakhuj@uwaterloo.ca <br />
<br />
Introduction:<br />
<br />
Controller Area Network (CAN bus) is a vehicle bus standard that allows Electronic Control Units (ECU) within an automobile to communicate with each other without the need for a host computer. Modern automobiles might have up to 70 ECUs for various subsystems such as Engine, Transmission, Breaking, etc. The ECUs exchange messages on the CAN bus and allow for a lot of modern vehicle capabilities such as automatic start/stop, electric park brakes, lane detection, collision avoidance, and more. Each message exchanged on the bus is encoded as a 29-bit packet. These 29 bits consist of a combination of Parameter Group Number (PGN), message priority, and the source address of the message. Parameter groups can be, for example, engine temperature which could include coolant temperature, fuel temperature, etc. The PGN itself includes information such as priority, reserved status, data page, and PDU format. Lastly, the source address maps the message to the ECU it originates from. <br />
<br />
Goals:<br />
<br />
(1) This project aims to use messages exchanged on the CAN bus of a Challenger Truck collected by the Embedded Systems Group at the University of Waterloo. The data exists in a temporal format with a new message exchanged periodically. The goals of this project are two folds:<br />
<br />
(2) Predicting the PGN and source address of message N exchanged on the bus, given messages 1 to N-1. We might also explore predicting attributes within the PGN. <br />
Predicting the delay between messages N-1 and N, given the delay between each pair of consecutive messages leading up to message N-1. <br />
<br />
Potential Approach:<br />
<br />
For the first goal, we intend to experiment with RNN models along with Attention modules since they have shown promising results in text generation/prediction. <br />
<br />
The second goal is more of an investigative problem where we intend to use regression techniques powered by Neural Networks to predict delays between messages N-1 and N.<br />
<br />
<br />
<br />
<br />
<br />
Project # 12 Group members:<br />
<br />
Hemati, Sobhan <br />
<br />
Meaney, Cameron <br />
<br />
Title: Representation learning of gigapixel histopathology images using PointNet a permutation invariant neural network<br />
<br />
Description:<br />
<br />
In recent years, there has been a significant growth in the amount of information available in digital pathology archives. This data is valuable because of its potential uses in research, education, and pathologic diagnosis. As a result, representation learning of histopathology whole slide images (WSIs) has attracted significant attention and become an active area of research. Unfortunately, scientific progress with these data have been difficult because of challenges inherent to the data itself. These challenges include highly complex textures of different tissue types, color variations caused by different stainings, and most notably, the size of the images which are often larger than 50,000x50,000 pixels. Additionally, these images are multi-resolution meaning that each WSI may contain images from different zoom levels, primarily 5X, 10X, 20X, and 40X. With the advent of deep learning, there is optimism that these challenges can be overcome. The main challenge in this approach is that the sheer size of the images makes it infeasible (or impossible) to obtain a vector representation for a WSI, which is a necessary step in order to leverage deep learning algorithms. In practice, this is often bypassed by considering ‘patches’ of the WSI of smaller sizes, a set of which is meant to represent the full WSI. This approach lead to a set representation for a WSI. However, unlike traditional image or sequence models, deep networks that process and learn permutation invariant representations from sets is still a developing area of research. Recent attempts at this include Multi-instance Learning Schemes, Deep Set, and Set Transformers. A particularly successful attempt in developing a deep neural network for set representation in called PointNet which was developed for classification and segmentation of 3D objects and point clouds. In PointNet, each set is represented using a set of (x,y,z) coordinates, and the network is designed to learn a permutation invariant global representation for each set and then use this representation for classification or segmentation.<br />
<br />
In this project, we attempt to first extend the PointNet network to a convolutional PointNet network such that it uses a set of image patches rather than (x,y,z) coordinates to learn the universal permutation invariant representation. Then, we attempt improve the representational power of PointNet as a permutation invariant neural network. For the first part, the main challenge is that while PointNet has been designed for processing of sets with the same size, in WSIs, the size of the image and therefore number of patches is not fixed. For this reason, we will need to develop an idea which enables CNN-PointNet to process sets with different sizes. One possible solution is to use fake members to standardize the set size and then remove the effect of these fake members in backpropagation using a masking scheme. For the second part, the PointNet network can be improved in many ways. For example, the rotation matrix used is not a real rotation matrix as the orthogonality is incorporated using a regularization term. However, using a projected gradient technique and the existence of a closed form solution for obtaining nearest orthogonal matrix to a given matrix (Orthogonal Procrustes Problem) we can keep the exact orthogonality constraint and obtain a real rotation matrix. This exact orthogonality is geometrically important as, otherwise, this transformation will likely corrupt the neighborhood structure of the points in each set. Furthermore, PointNet uses very simple symmetric function (max pooling) as a set approximator, however there more powerful symmetric functions like statistical moments, power-sum with a trainable parameter, and other set approximators can be used. It would be interesting to see how more complicated symmetric functions can improve the representational power of PointNet to achieve more discriminative permutation invariant representations for each set (in this case WSIs).<br />
<br />
Project # 13 Group Members:<br />
<br />
Syed Saad Naseem ssnaseem@uwaterloo.ca<br />
<br />
Title: Text classification of topics related to COVID-19 on social media using deep learning<br />
The COVID-19 pandemic has become a public health emergency and a critical socioeconomic issue worldwide. It is changing the way we live and do business. Social media is a rich source of data about public opinion on different types of topics including topics about COVID-19. I plan on using Reddit to get a dataset of posts and comments from users related to COVID-19 and since Reddit is divided into communities so the posts and comments are also clustered by the topic of the community, for example, posts from the political subreddit will have posts about politics.<br />
<br />
I plan to make a classifier that will take a given text and will tell what the text of talking about for example it can be talking about politics, studies, relationships, etc. The goals of this project are to:<br />
<br />
• Scrape a dataset from Reddit from different communities<br />
<br />
• Train a deep learning model (CNN or RNN model) to classify a given text into the possible categories<br />
<br />
• Test the model on posts from social talking about COVID-19<br />
<br />
<br />
<br />
Project # 14 Group members<br />
<br />
Edwards, John<br />
<br />
Title: Click-through Rate Prediction Using Historical User Data<br />
<br />
Click-through Rate (CTR) prediction consists of forecasting a users probability of clicking on a specified target. CTR is used largely by online advertising systems which sell ad space on a cost-per-click pricing model to asses the likenesses of a user clicking on a targeted ad. <br />
<br />
User session logs provides firms with an assortment of individual specific features, a large - number of which are categorical. Additionally, advertisers posses multiple ad candidates each with their own respective features. The challenge of CTR prediction is to design a model which encompass the Interacting effects of these features to produced high quality forecasts and pair users with advertisements with high potential for click conversion. Additionally computational efficiency must balanced with model complexity so that predictions can be done in an online setting throughout the progression of a users session.<br />
<br />
This projects primary objective will be to attempt creating a new Deep Neural Network (DNN) architecture for producing high quality CTR forecasts while also satisfying the aforementioned challenges.<br />
<br />
While many variants of DNN for CTR predictions exists they can differ greatly in application setting. Specifically, the vast majority of models evaluate each user-ad interaction independently. They fail to utlise information contained for each specific users’ historical add impressions. There is only a small subset of models [1,2,4] which have tried to address this by adapting architectures to utilize historical information. This projects focus will be within this application setting exploring new architectures which can better utilise information contained within a users historical behaviour. <br />
<br />
This projects implementation will consist of the following action plan:<br />
Develop a new model architecture inspired by innovations of previous CTR network designs which lacked the ability to adapt their model to utlize a users historical data [4,5].<br />
Use the public benchmark Avito advertising dataset to empirically evaluate the new models performance and compare it against previous state of the art models for this data set. <br />
<br />
References:<br />
<br />
[1] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Li, Li & Liu, Zhaojie & Du, Yanlong. (2019). Click-Through Rate Prediction with the User Memory Network. <br />
<br />
[2] Ouyang, Wentao & Zhang, Xiuwu & Li, Li & Zou, Heng & Xing, Xin & Liu, Zhaojie & Du, Yanlong. (2019). Deep Spatio-Temporal Neural Networks for Click-Through Rate Prediction. 2078-2086. 10.1145/3292500.3330655. <br />
<br />
[3] Ouyang, Wentao & Zhang, Xiuwu & Ren, Shukui & Qi, Chao & Liu, Zhaojie & Du, Yanlong. (2019). Representation Learning-Assisted Click-Through Rate Prediction. 4561-4567. 10.24963/ijcai.2019/634. <br />
<br />
[4] Li, Zeyu, Wei Cheng, Yang Chen, H. Chen and W. Wang. “Interpretable Click-Through Rate Prediction through Hierarchical Attention.” Proceedings of the 13th International Conference on Web Search and Data Mining (2020)<br />
<br />
[5] Zhou, Guorui & Gai, Kun & Zhu, Xiaoqiang & Song, Chenru & Fan, Ying & Zhu, Han & Ma, Xiao & Yan, Yanghui & Jin, Junqi & Li, Han. (2018). Deep Interest Network for Click-Through Rate Prediction. 1059-1068. 10.1145/3219819.3219823.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=49715Time-series Generative Adversarial Networks2020-12-07T03:57:48Z<p>A227jain: Grammatical improvements</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs along with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup>, based on curriculum learning <sup>[2]</sup>, where the models are trained to output based on a combination of ground truth and previous outputs. Another method inspired by adversarial domain adaptation is training an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>. Approach based on Actor-critic methods <sup>[5]</sup> have also been proposed that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions <sup>[11]</sup>. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics.<br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain constant over the entire time-series, or for a long period of time) and temporal features (variables that changes with respect to time). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, inputs to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution <math>p</math>. The objective of a generative model is to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how the information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form:<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form:<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture.<br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>\lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation:<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate Gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses multiple datasets in different areas like Sines, Stocks, Energy, and Events with different methods(TimeGAN, RCGANM CRANNGANM, and etc), and visualize their performance between original data and synthetic. <br />
<br />
===Sines===<br />
They simulated multivariate sinusoidal sequences of different frequencies η and phases θ, providing continuous-valued, periodic, multivariate data where each feature is independent of others.<br />
<br />
===Stocks===<br />
By contrast, sequences of stock prices are continuous-valued but aperiodic; furthermore, features are correlated with each other. They use the daily historical Google stocks data from 2004 to 2019, including as features the volume and high, low, opening, closing, and adjusted closing prices.<br />
<br />
===Energy===<br />
They consider a dataset characterized by noisy periodicity, higher dimensionality, and correlated features. The UCI Appliances energy prediction dataset consists of multivariate, continuous-valued measurements including numerous temporal features measured at close intervals.<br />
<br />
===Events===<br />
Finally, they considered a dataset characterized by discrete values and irregular time stamps. They used a large private lung cancer pathways dataset consisting of sequences of events and their times, and model both the one-hot encoded sequence of event types and the event timings.<br />
<br />
Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
We could use traditional models like ARMA, ARIMA, and etc to analyze time-series type data. Also, the longitudinal data could be predicted by the generated additive model, linear mixed affected model to do both the feature selection and independent variable predicting. The author of this paper introduced an advanced architecture to generate new time-series data that combines the flexibility of unsupervised learning, and the control of the quality of the generated data by supervised learning.<br />
<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce real-time sequences with differential privacy guarantees.<br />
<br />
== Critique ==<br />
The method introduced in this paper is truly a novel one. The idea of enhancing the unsupervised components of a GAN with some supervised element has shown significant jumps in certain evaluations. I think the methods of evaluation used in this paper namely, t-SNE/PCA analysis (visualization), discriminative score, and predictive score; are very appropriate for this sort of analysis where the focus is on multiple things (generative accuracy and conditional dependence) both quantitatively and qualitatively. Other related works <sup>[9]</sup> have also used the same evaluation setup.<br />
<br />
The idea of the synthesized time-series being useful in terms of its predictive ability is good, especially in practice. But I think when the authors set out to create a model that can learn the temporal dynamics between time-series data then there could have been some additional metric that could better evaluate if the underlying temporal relations have been captured by the model or not. I feel the addition of some form of temporal correlation analysis would have added to the completeness of the paper.<br />
<br />
The enhancement of traditional GAN by simply adding an extra loss function to the mix is quite elegant. TimeGAN uses a stepwise supervised loss. The authors have also used very common choices for the various components of the overall TimeGAN network. This leaves a lot of possibilities in this area as many direct and indirect variations of TimeGAN or other architectures inspired by TimeGAN can be developed in a very straightforward manner of hyper-parameterizing the building blocks. <br />
<br />
TimeGAN benefits from merging supervised and unsupervised learning to create their generations while other methods in the literature benefit from learning their conditional input to create its generations. I believe after even considering the supervised and unsupervised learning, the way that the authors introduced temporal embeddings to assist network training is not designed well for anomaly detection (outlier detection) as it is only designed for time series representation learning as discussed in [10].<br />
<br />
The paper certainly proposes a novel approach to analyzing time series data, but there are concerns about the way the model is tested in practice. First, if the data is generated from a <math>VAR(1)</math> model, why would the authors would not use a multi-dimensional auto-ARIMA procedure, or a Box-Jenkins approach, to fit a model to their synthetic dataset. Moreover, as has been studied in the M4 competitions (see e.g. https://www.sciencedirect.com/science/article/pii/S0169207019301128), the ability of complex ML models or deep learning models to beat linear models, in general, is questionable. The theoretical reason for this empirical finding is that the Wold decomposition theorem says that a stationary process can be decomposed into the sum of a deterministic process and linear process, which gives a lot of credence to the ARIMA model. It would be highly beneficial if the authors included the Box-Jenkins benchmark in their experiments as well as testing their model against real data to see if it actually performs well.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018<br />
<br />
[9] Hao Ni, L. Szpruch, M. Wiese, S. Liao, Baoren Xiao. Conditional Sig-Wasserstein GANs for Time Series Generation, 2020<br />
<br />
[10] Geiger, Alexander et al. “TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks.” ArXiv abs/2009.07769 (2020): n. pag.<br />
<br />
[11] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.<br />
<br />
[12] Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. "The M4 Competition: 100,000 time series and 61 forecasting methods." International Journal of Forecasting 36.1 (2020): 54-74.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data&diff=49382Learning The Difference That Makes A Difference With Counterfactually-Augmented Data2020-12-06T12:15:37Z<p>A227jain: Updated introduction details and improved grammatical issues</p>
<hr />
<div>== Presented by == <br />
Syed Saad Naseem<br />
<br />
== Introduction == <br />
This paper addresses the problem of building models for NLP tasks that are robust against spurious correlations in the data. The authors tackle this problem by introducing a human-in-the-loop method in which human annotators were hired to modify data in order to make it in a way that represents the opposite label. For example, if a text had a positive sentiment to it, the annotators change the text such that it represents the negative sentiment while making minimal changes to the text. They refer to this process as counterfactual augmentation. The authors apply this method to the IMDB sentiment dataset and to SNLI and show that many models can not perform well on the augmented dataset is trained only on the original dataset and vice versa. The human-in-the-loop system which is designed for counterfactually manipulating documents aims that by intervening only upon the factor of interest, they might disentangle the spurious and non-spurious associations, yielding classifiers that hold up better when spurious associations do not transport out of the domain. <br />
<br />
== Background == <br />
'''What are spurious patterns in NLP, and why do they occur?'''<br />
<br />
Current supervised machine learning systems try to learn the underlying features of input data that associate the inputs with the corresponding labels. Take Twitter sentiment analysis as an example, there might be lots of negative tweets about Donald Trump. If we use those tweets as training data, the ML systems tend to associate "Trump" with the label: Negative. However, the text itself is completely neutral. The association between the text trump and the label negative is spurious. One way to explain why this occurs is that association does not necessarily mean causation. For example, the color gold might be associated with success. But it does not cause success. Current ML systems might learn such undesired associations and then deduce from them. This is typically caused by an inherent bias within the data. ML models then learn the inherent bias which leads to biased predictions.<br />
<br />
== Data Collection ==<br />
The authors used Amazon’s Mechanical Turk which is a crowdsourcing platform using to recruit editors. They hired these editors to revise each document. <br />
<br />
'''Sentiment Analysis'''<br />
<br />
The dataset to be analyzed is the IMDb movie review dataset. The annotators were directed to revise the reviews to make them counterfactual, without making any gratuitous changes. There are several types of changes that were applied and two examples are listed below, where red represents original text and blue represents modified text.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Type of Change !! Original Review !! Modified Review<br />
|-<br />
| Change ratings || one of the worst ever scenes in a sports movie. <span style="color:red">3 stars out of 10</span>. || one of the wildest ever scenes in a sports movie. <span style="color:blue">8 stars out of 10</span>.<br />
|-<br />
| Suggest sarcasm || thoroughly captivating <span style="color:red">thriller-drama, taking a deep and realistic</span> view. || thoroughly mind numbing <span style="color:blue">“thriller-drama”, taking a “deep” and “realistic” (who are they kidding?)</span> view.<br />
|}<br />
<br />
'''Natural Language Inference'''<br />
<br />
The NLI is a 3-class classification task, where the inputs are a premise and a hypothesis. Given the inputs, the model predicts a label that is meant to describe the relationship between the facts stated in each sentence. The labels can be entailment, contradiction, or neutral. The annotators were asked to modify the premise of the text while keeping the hypothesis intact and vice versa. Some examples of modifications are given below with labels given in the parentheses.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Premise !! Original Hypothesis !! Modified Hypothesis<br />
|-<br />
| A young dark-haired woman crouches on the banks of a river while washing dishes. || A woman washes dishes in the river <span style="color:red">while camping</span> (Neutral) || A woman washes dishes <span style="color:blue">in the river</span>. (Entailment)<br />
|-<br />
| Students are inside of a lecture hall || Students are <span style="color:red">indoors</span>. (Entailment) || Students are <span style="color:blue">on the soccer field</span>. (Contradiction)<br />
|-<br />
| An older man with glasses raises his eyebrows in surprise. || The man <span style="color:red">has no glasses</span>. (Contradiction) || The man <span style="color:blue">wears bifocals</span>. (Neutral)<br />
|}<br />
<br />
After the data collection, a different set of workers was employed to verify whether the given label<br />
accurately described the relationship between each premise-hypothesis pair. Each pair was presented to 3 workers and the pair was only accepted if all 3 of the workers approved that the text is accurate. This entire process cost the authors about $10778.<br />
<br />
== Example ==<br />
In the picture below, we can see an example of spurious correlation and how the method presented here can address that. The picture shows the most important features learned by SVM. As we can see in the left plot when the model is trained only on the original data, the word "horror" is associated with the negative label and the word "romantic" is associated with the positive label. This is an example of spurious correlation because we definitely can have both bad romantic and good horror movies. The middle plot shows the case that the model is trained only on the revised dataset. As we expected the situation is vice versa, that is, "horror" and "romantic" are associated with the positive and negative labels respectively. However, the problem is solved in the right plot where the authors trained the model on both the original and the revised datasets. The words "horror" and "romantic" are no longer among the most important features which is what we wanted.<br />
<br />
[[File: SVM features.png | center |800px]]<br />
<br />
== Experiments ==<br />
===Sentiment Analysis===<br />
The authors carried out experiments on a total of 5 models: Support Vector Machines (SVMs), Naive Bayes<br />
(NB) classifiers, Bidirectional Long Short-Term Memory Networks, ELMo models with LSTM, and fine-tuned BERT models. Furthermore, they evaluated their models on Amazon reviews datasets aggregated over six genres, they also evaluated the models on twitters sentiment dataset and on Yelp reviews released as part of a Yelp dataset challenge. They showed that almost all cases, models trained on the counterfactually-augmented<br />
IMDb dataset perform better than models trained on comparable quantities of original data, this is shown in the table below.<br />
<br />
[[File:result1_syed.PNG]]<br />
<br />
===Natural Language Inference===<br />
<br />
To see the results of BERT model on the SNLI tasks, the authors used different sets of train and eval sets. The fine-tuned version of BERT on the original data(1.67k) performs well on the original eval set; however, the accuracy drops from 72.2% to 39.7% when evaluated on the RP(Revised Premise) set. It is also the case even with the full original set(500k) i.e. the accuracy of the model drops significantly on the RP, RH (Revised Hypothesis), and RP&RH datasets. In Table 7, you can see that the BERT model which was fine-tuned on a combination of RP and RH leads to consistent performance on all datasets.<br />
<br />
[[File:NLI.png|center]]<br />
== Source Code ==<br />
<br />
The official codes are available at https://github.com/acmi-lab/counterfactually-augmented-data .<br />
== Conclusion ==<br />
<br />
The authors propose a new way to augment textual datasets for the task of sentiment analysis, this helps the learning methods used to generalize better by concentrating on learning the different that makes a difference. I believe that the main contribution of the paper is the introduction of the idea of counterfactual datasets for sentiment analysis. The paper proposes an interesting approach to tackle NLP problems, shows intriguing experimental results, and presents us with an interesting dataset that may be useful for future research. Indeed, this work has been cited in several interesting works examining gender bias in NLP [1], making AI programs more ethical [2], and generating humor text [3].<br />
<br />
== References ==<br />
<br />
[1] Lu, K., Mardziel, P., Wu, F., Amancharla, P., & Datta, A. (2018). Gender Bias in Neural Natural Language Processing.<br />
<br />
[2] Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2020). Aligning AI With Shared Human Values. 1–22.<br />
<br />
[3] Weller, O., Fulda, N., & Seppi, K. (2020). Can Humor Prediction Datasets be used for Humor Generation? Humorous Headline Generation via Style Transfer. 186–191.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=49381BERTScore: Evaluating Text Generation with BERT2020-12-06T12:02:57Z<p>A227jain: </p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either use n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
METEOR (Banerjee & Lavie, 2005) computes Exact- <math> P_1 </math> and Exact- <math> R_1 </math> with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, ''running'' may be matched with ''run'' if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types. <br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
<div align="center">Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> </div><br />
<br />
<div align="center">Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> </div><br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics which compare two strings by calculating the minimum operations to transform one into the other, Embedding-based metrics which are derive based on an applied embedding space to the strings, and Learned Metrics which construct task specific-metrics using a machine learning approach on a supervised data set. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgements as supervision for each datasets.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
=== Experiment & Results ===<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. In addition to the standard evaluation, they have also designed model selection experiments. We use 10K hybrid systems super-sampled from WMT18. They randomly select 100 out of 10K hybrid systems and rank them using the automatic metrics. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=49380BERTScore: Evaluating Text Generation with BERT2020-12-06T12:01:03Z<p>A227jain: Added important details on Experiments & Results</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either use n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
METEOR (Banerjee & Lavie, 2005) computes Exact- <math> P_1 </math> and Exact- <math> R_1 </math> with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, ''running'' may be matched with ''run'' if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types. <br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
<div align="center">Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> </div><br />
<br />
<div align="center">Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> </div><br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics which compare two strings by calculating the minimum operations to transform one into the other, Embedding-based metrics which are derive based on an applied embedding space to the strings, and Learned Metrics which construct task specific-metrics using a machine learning approach on a supervised data set. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgements as supervision for each datasets.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. In addition to the standard evaluation, they have also designed model selection experiments. We use 10K hybrid systems super-sampled from WMT18. They randomly select 100 out of 10K hybrid systems and rank them using the automatic metrics. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE&diff=49379SuperGLUE2020-12-06T11:30:38Z<p>A227jain: Grammatical improvements</p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3] and BERT[4], have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since its release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. However, the latest GPT and BERT models far outperform these benchmarks and strike a need for a more robust and difficult benchmark.<br />
<br />
Limits to current approaches are also apparent via the GLUE suite. Performance on the GLUE diagnostic entailment dataset falls far below the average human performance of 0.80 R3 reported in the original GLUE publication, with models performing near, or even below, chance on some linguistic phenomena.<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification and fake news detection. The fine-tuned models beat many of the human labelers who weren’t experts in the domain. Thus, it creates a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== Improvements to GLUE ==<br />
SuperGLUE follows the design principles of GLUE but seeks to improve on its predecessor in many ways:<br />
<br />
'''More challenging tasks:''' SuperGLUE contains the two hardest tasks in GLUE and open tasks that are difficult to current NLP approaches.<br />
<br />
'''More diverse task formats:''' SuperGLUE expands GLUE task formats to include coreference resolution and question answering.<br />
<br />
'''Comprehensive human baselines:''' Human performance estimates are provided for all benchmark tasks.<br />
<br />
'''Improved code support:''' SuperGLUE is built around the widely used tools including PyTorch and AllenNLP.<br />
<br />
'''Refined usage rules:''' SuperGLUE leaderboard ensures fair competition and full credit to creators.<br />
<br />
== Design Process ==<br />
<br />
SuperGLUE is designed to be widely applicable to many different NLP tasks. That being said, in designing SuperGLUE, certain criteria needed to be established to determine whether an NLP task can be completed. The authors specified six such requirements, which are listed below.<br />
<br />
#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.<br />
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.<br />
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgments of the output quality.<br />
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.<br />
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.<br />
#'''License:''' Task data must be under a license that allows the redistribution and use for research.<br />
<br />
To select tasks included in the benchmarks, the authors put a public request for NLP tasks and received many. From this, they filtered the tasks according to the criteria above and eliminated any tasks that could not be used due to licensing issues or other problems.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to most college-educated English speakers' capabilities and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written to keep the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. The Passages are taken from seven domains including news, fiction, and historical text etc.<br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, were given a passage with a masked entity, the model should be able to predict the masked out entity from the available choices. The articles are extracted from CNN and Daily Mail.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
technologies<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
The table below briefly corresponds to the different tasks included in SuperGLUE along with the task type and size of the datasets. In the table, WSD stands for word sense disambiguation, NLI is natural language inference, coref. is coreference resolution, and QA is question answering.<br />
<br />
[[File: supergluetasks.png]]<br />
<br />
In the following chart[18], you can see the differences between the different benchmarks.<br />
[[File: superglue.JPG]]<br />
<br />
<br />
===Scoring===<br />
With GLUE, they seek to give a sense of aggregate system performance overall tasks by averaging all tasks scores. Lacking a fair criterion to weigh the contributions of each task to the overall score, they opt for the simple approach of weighing each task equally and for tasks with multiple metrics, first averaging those metrics to get a task score.<br />
<br />
== Model Analysis ==<br />
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperGLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset.<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The human results for WiC, MltiRC, RTE, and ReCoRD were already available on [15], [12], [17], and [13] respectively. However, for the remaining tasks, the authors employed crowd workers to reannotate a sample of each test set according to the methods used in [17]. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Source Code ==<br />
<br />
The source code is available at https://github.com/nyu-mll/jiant .<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.<br />
<br />
[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.<br />
<br />
[18] Storks, Shane, Qiaozi Gao, and Joyce Y. Chai. "Recent advances in natural language inference: A survey of benchmarks, resources, and approaches." arXiv preprint arXiv:1904.01172 (2019).</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE&diff=49378SuperGLUE2020-12-06T11:27:14Z<p>A227jain: Added related work info</p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3] and BERT[4], have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since its release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. However, the latest GPT and BERT models far outperform these benchmarks and strike a need for a more robust and difficult benchmark.<br />
<br />
Limits to current approaches are also apparent via the GLUE suite. Performance on the GLUE diagnostic entailment dataset falls far below the average human performance of 0.80 R3 reported in the original GLUE publication, with models performing near, or even below, chance on some linguistic phenomena.<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification and fake news detection. The fine-tuned models beat many of the human labelers who weren’t experts in the domain. Thus, it creates a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== Improvements to GLUE ==<br />
SuperGLUE follows the design principles of GLUE but seeks to improve on its predecessor in many ways:<br />
<br />
'''More challenging tasks:''' SuperGLUE contains the two hardest tasks in GLUE and open tasks that are difficult to current NLP approaches.<br />
<br />
'''More diverse task formats:''' SuperGLUE expands GLUE task formats to include coreference resolution and question answering.<br />
<br />
'''Comprehensive human baselines:''' Human performance estimates are provided for all benchmark tasks.<br />
<br />
'''Improved code support:''' SuperGLUE is build around the widely used tools including PyTorch and AllenNLP.<br />
<br />
'''Refined usage rules:''' SuperGLUE leaderboard ensures fair competition and full credit to creators.<br />
<br />
== Design Process ==<br />
<br />
SuperGLUE is designed to be widely applicable to many different NLP tasks. That being said, in designing SuperGLUE, certain criteria needed to be established to determine whether a NLP task can be completed. The authors specified six such requirements, which are listed below.<br />
<br />
#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.<br />
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.<br />
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgements of the output quality.<br />
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.<br />
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.<br />
#'''License:''' Task data must be under a license that allows the redistribution and use for research.<br />
<br />
To select tasks included in the benchmarks, the authors put a public request for NLP tasks and received many. From this, they filtered the tasks according to the criteria above and eliminated any tasks that could not be used due to licensing issues or other problems.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to most college-educated English speakers' capabilities and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written to keep the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. The Passages are taken from seven domains including news, fiction, and historical text etc.<br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the available choices. The articles are extracted from CNN and Daily Mail.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
The table below briefly corresponds to the different tasks included in SuperGLUE along with the task type and size of the datasets. In the table, WSD stands for word sense disambiguation, NLI is natural language inference, coref. is coreference resolution, and QA is question answering.<br />
<br />
[[File: supergluetasks.png]]<br />
<br />
In the following chart[18], you can see the differences between the different benchmarks.<br />
[[File: superglue.JPG]]<br />
<br />
<br />
===Scoring===<br />
With GLUE, they seek to give a sense of aggregate system performance overall tasks by averaging all tasks scores. Lacking a fair criterion to weigh the contributions of each task to the overall score, they opt for the simple approach of weighing each task equally and for tasks with multiple metrics, first averaging those metrics to get a task score.<br />
<br />
== Model Analysis ==<br />
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperGLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset.<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The human results for WiC, MltiRC, RTE, and ReCoRD were already available on [15], [12], [17], and [13] respectively. However, for the remaining tasks, the authors employed crowdworkers to reannotate a sample of each test set according to the methods used in [17]. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Source Code ==<br />
<br />
The source code is available at https://github.com/nyu-mll/jiant .<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.<br />
<br />
[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.<br />
<br />
[18] Storks, Shane, Qiaozi Gao, and Joyce Y. Chai. "Recent advances in natural language inference: A survey of benchmarks, resources, and approaches." arXiv preprint arXiv:1904.01172 (2019).</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=49377a fair comparison of graph neural networks for graph classification2020-12-06T11:23:42Z<p>A227jain: Added for more information for GNNs and improved grammatical mistakes</p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
<br />
Experimental reproducibility in machine learning has been known to be an issue for some time. Researchers attempting to reproduce the results of old algorithms have come up short, raising concerns that lack of reproducibility hurts the quality of the field. Lack of open source AI code has only exacerbated this, leading some to go so far as to say that "AI faces a reproducibility crisis" [1]. It has been argued that the ability to reproduce existing AI code, and making these codes and new ones open source is a key step in lowering the socio-economic barriers of entry into data science and computing. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce the results from such experiments to tackle the problem of ambiguity in experimental procedures and the impossibility of reproducing results. They also Standardized the experimental environment so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models that take graph-structured data as input and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbors. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
They provide a convenient way for node level, edge level, and graph level prediction task. The intuition of GNN is that nodes are naturally defined by their neighbors and connections. A typical application of GNN is node classification. Essentially, every node in the graph is associated with a label, and we want to predict the label of the nodes without ground-truth.<br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural.<br />
<br />
====Graph basics====<br />
<br />
Graphs come from discrete mathematics and as previously mentioned are comprised of two building blocks, vertices (nodes), <math>v_i \in V</math>, and edges, <math>e_j \in E</math>. The edges in a graph can also have a direction associated with them lending the name '''directed graph''' or they can be an '''undirected graph''' if an edge is shared by two vertices and there is no sense of direction. Vertices and edges of a graph can also have weights to them or really any amount of features imaginable. <br />
<br />
Now going one level of abstraction higher graphs can be categorized by structural patterns, we will refer to these as the types of graphs and this will not be an exhaustive list. A '''Bipartite graph''' (a) is one in which there are two sets of vertices <math>V_1</math> and <math>V_2</math> and there does not exist, <math> v_i,v_j \in V_k </math> where <math>k=1,2</math> s.t. <math>v_i</math> and <math>v_j </math> share an edge, however, <math>\exists v_i \in V_1, v_j \in V_2</math> where <math>v_i</math> and <math>v_j </math> share an edge. A '''Path graph''' (b) is a graph where, <math>|V| \geq 2</math> and all vertices are connected sequentially meaning each vertex except the first and last have 2 edges, one coming from the previous vertex and one going to the next vertex. A '''Cycle graph''' (c) is similar to a path graph except each node has 2 edges and are connected in a loop, meaning if you start at any vertex and follow an edge of each node going in one direction it will eventually lead back to the starting node. These are just three examples of graph types in reality there are many more and it can beneficial to be able to connect the structure of ones data to an appropriate graph type.<br />
<br />
<gallery mode="packed"><br />
Image:bipartite.png| (a) Bipartite Graph<br />
Image:path.gif| (b) Path Graph<br />
Image:cycle.png| (c) Cycle Graph<br />
</gallery><br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assessment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross-Validation.<br />
As the model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models. <br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyperparameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set. It also important to acknowledge the selection bias when selecting a model as this makes the validation accuracy of a selected model from a pool of candidates models a biased test accuracy.<br />
<br />
==Overview of Reproducibility Issues==<br />
The paper explores five different GNN models exploring issues with their experimental setup and potential reproducibility. <br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels|]]<br />
<br />
Where (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A)<br />
indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria).<br />
<br />
===Issues with DGCNN (Deep Graph Convolutional Neural Network)===<br />
The authors of DGCNN use a faulty method of tuning the learning rate and epoch. They used only a single fold for tuning hyperparameters despite evaluating the model on a 10-fold CV. This potentially leads to suboptimal performance. They haven't released the code for the experiments. Lastly, they average the one-fold CV across 10 folds and then report the numbers. This also reduces variance.<br />
<br />
=== Issues with DiffPoll === <br />
It has not been clearly stated in the paper whether the results come from a test set or if they come from a validation set. Moreover, the standard deviation over the 10-fold CV has also not been reported. Due to no random seeds, different data splits are there while performing multi-fold splits (without stratification).<br />
<br />
=== Issue with ECC ===<br />
The results of the paper do not report the standard deviation obtained during the 10-fold Cross-Validation. Like in the case of GDCNN, the model selection procedure is not made clear due to pre-determined hyper-parameters. The code repository is not available as well.<br />
<br />
=== Issues with GIN === <br />
Instead of reporting the test accuracy, the authors have given the validation accuracy over the 10-fold CV. Therefore, the given results are not suitable for evaluating the model. Code repository is not available for selecting the model.<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
a subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with the literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
<br />
[[File:image_1.png|900px|center|Image: 900 pixels]]<br />
<div align="center">'''Figure 2:''' Visualization Of the Evaluation Framework </div><br />
In order to better understand the Model Selection and the Model Assessment sections in the above figure, one can also take a look at the pseudo codes below.<br />
[[File:pseudo-code_paper11.png|900px|center|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
===Effect of Node Degree on Layering===<br />
[[File:Paper11_NodeDegree.png]]<br />
<br />
The above table displays the median number of selected layers in relation to the addition of node<br />
degrees as input features on all social datasets. 1 indicates that an uninformative feature is used as<br />
a node label.<br />
<br />
<br />
===Comparison with Published Results===<br />
[[File:paper11.png|900px|Image: 900 pixels]]<br />
<br />
<br />
In the above figure, we can see the comparison between the average values of test results obtained by the authors of the paper and those reported in the literature. The plots show how the test accuracies calculated in this paper are in most cases different from what reported in the literature, and the gap between the two estimates is usually consistent.<br />
== Source Codes ==<br />
The data and scripts to reproduce the experiments reported in the paper are available at https://github.com/diningphil/gnn-comparison .<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
<br />
==Critique==<br />
This paper raises an important issue about the reproducibility of some important 5 graph neural network models on 9 datasets. The reproducibility and replicability problems are very important topics for science in general and even more important for fast-growing fields like machine learning. The authors proposed a unified scheme for evaluating reproducibility in graph classification papers. This unified approach can be used for future graph classification papers such that the comparison between proposed methods becomes clearer. The results of the paper are interesting as in some cases the baseline methods outperform other proposed algorithms. Finally, I believe one of the main limitations of the paper is the lack of technical discussion. For example, this was a good idea to discuss in more depth why baseline models are performing better? Or why the results across different datasets are not consistent? Should we choose the best GNN based on the type of data? If so, what are the guidelines?<br />
<br />
Also as well known in the literature of GNNs that they are designed to solve the non-Euclidean problems on graph-structured data. This is kinds of problems are hardly be handled by general deep learning techniques and there are different types of designed graphs that handle various mechanisms i.e. heat diffusion mechanisms. In my opinion, there would a better way to categorize existing GNN models into spatial and spectral domains and reveal connections among subcategories in each domain. With the increase of the GNNs models, further analysis must be handled to establish a strong link across the spatial and spectral domains to be more interpretable and transparent to the application.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.<br />
<br />
- Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5165–5175. Curran Associates, Inc., 2018.<br />
<br />
[1] Hutson, M. (2018). Artificial intelligence faces a reproducibility crisis. Science, 359(6377), 725–726.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation&diff=49376One-Shot Object Detection with Co-Attention and Co-Excitation2020-12-06T11:18:54Z<p>A227jain: Added background information and improved sentences</p>
<hr />
<div>== Presented By ==<br />
Gautam Bathla<br />
<br />
== Background ==<br />
<br />
Object Detection is a technique where the model gets an image as an input and outputs the class and location of all the objects present in the image. The aim is to take a query image patch whose class label is not included in the training data and detect all instances of the same class in a target image.<br />
<br />
[[File:object_detection.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Object Detection on an image</div><br />
<br />
Figure 1 shows an example where the model identifies and locates all the instances of different objects present in the image successfully. It encloses each object within a bounding box and annotates each box with the class of the object present inside the box.<br />
<br />
State-of-the-art object detectors are trained on thousands of images for different classes before the model can accurately predict the class and spatial location for unseen images belonging to the classes the model has been trained on. When a model is trained with K labeled instances for each of N classes, then this setting is known as N-way K-shot classification. K = 0 for zero-shot learning, K = 1 for one-shot learning and K > 1 for few-shot learning.<br />
<br />
== Introduction ==<br />
<br />
This paper tackles the problem of one-shot object detection, where the model needs to find all the instances in the target image of the object in the query image for a given query image ''p''. The target and query image do not need to be exactly the same and are allowed to have variations as long as they share some attributes so that they can belong to the same category. In this paper, the authors have made contributions to three technical areas. First is the use of non-local operations to generate better region proposals for the target image based on the query image. This operation can be thought of as a co-attention mechanism. The second contribution is proposing a Squeeze and Co-Excitation mechanism to identify and give more importance to relevant features to filter out relevant proposals and hence the instances in the target image. Third, the authors designed a margin-based ranking loss which will be useful for predicting the similarity of region proposals with the given query image irrespective of whether the label of the class is seen or unseen during the training process.<br />
<br />
== Previous Work ==<br />
<br />
All state-of-the-art object detectors are variants of deep convolutional neural networks. There are two types of object detectors:<br />
<br />
1) Two-Stage Object Detectors: These types of detectors generate region proposals in the first stage whereas classify and refine the proposals in the second stage. Eg. FasterRCNN [1].<br />
<br />
2) One Stage Object Detectors: These types of detectors directly predict bounding boxes and their corresponding labels based on a fixed set of anchors. Eg. CornerNet [2].<br />
<br />
The work done to tackle the problem of few-shot object detection is based on transfer learning [3], meta-learning [4], and metric-learning.<br />
<br />
1) Transfer Learning: Chen et al. [3] proposed a regularization technique to reduce overfitting when the model is trained on just a few instances for each class belonging to unseen classes.<br />
<br />
2) Meta-Learning: Kang et al. [4] trained a meta-model to re-weight the learned weights of an image extracted from the base model.<br />
<br />
3) Metric-Learning: These frameworks replace the conventional classifier layer with the metric-based classifier layer.<br />
<br />
== Approach ==<br />
<br />
Let <math> C </math> be the set of classes for this object detection task. Since one-shot object detection task needs unseen classes during inference time, therefore we divide the set of classes into two categories as follows:<br />
<br />
<div style="text-align: center;"><math> C = C_0 \bigcup C_1,</math></div><br />
<br />
where <math>C_0</math> represents the classes that the model is trained on and <math>C_1</math> represents the classes on which the inference is done.<br />
<br />
[[File:architecture_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Architecture</div><br />
<br />
Figure 2 shows the architecture of the model proposed in this paper. The model architecture is based on FasterRCNN [1], and ResNet-50 [5] has been used as the backbone for extracting features from the images. The target image and the query image are first passed through the ResNet-50 module to extract the features from the same convolutional layer. The features obtained are next passed into the Non-local block as input and the output consists of weighted features for each of the images. The new weighted feature set for both images is passed through Squeeze and Co-excitation block which outputs the re-weighted features which act as an input to the Region Proposal Network (RPN) module. RCNN module also consists of a new loss that is designed by the authors to rank proposals in order of their relevance.<br />
<br />
==== Non-Local Object Proposals ====<br />
<br />
The need for non-local object proposals arises because the RPN module used in Faster R-CNN [1] has access to bounding box information for each class in the training dataset. The dataset used for training and inference in the case of Faster R-CNN [1] is not exclusive. In this problem, as we have defined above that we divide the dataset into two parts, one part is used for training and the other is used during inference. Therefore, the classes in the two sets are exclusive. If the conventional RPN module is used, then the module will not be able to generate good proposals for images during inference because it will not have any information about the presence of bounding-box for those classes.<br />
<br />
To resolve this problem, a non-local operation is applied to both sets of features. This non-local operation is defined as:<br />
\begin{align}<br />
y_i = \frac{1}{C(z)} \sum_{\forall j}^{} f(x_i, z_j)g(z_j) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
where ''x'' is a vector on which this operation is applied, ''z'' is a vector which is taken as an input reference, ''i'' is the index of output position, ''j'' is the index that enumerates over all possible positions, ''C(z)'' is a normalization factor, <math>f(x_i, z_j)</math> is a pairwise function like Gaussian, Dot product, concatenation, etc., <math>g(z_j)</math> is a linear function of the form <math>W_z \times z_j</math>, and ''y'' is the output of this operation.<br />
<br />
Let the feature maps obtained from the ResNet-50 model be <math> \phi{(I)} \in R^{N \times W_I \times H_I} </math> for target image ''I'' and <math> \phi{(p)} \in R^{N \times W_p \times H_p} </math> for query image ''p''. Taking <math> \phi{(p)} </math> as the input reference, the non-local operation is applied to <math> \phi{(I)} </math> and results in a non-local block, <math> \psi{(I;p)} \in R^{N \times W_I \times H_I} </math> . Analogously, we can derive the non-local block <math> \psi{(p;I)} \in R^{N \times W_p \times H_p} </math> using <math> \phi{(I)} </math> as the input reference. <br />
<br />
We can express the extended feature maps as:<br />
<br />
\begin{align}<br />
{F(I) = \phi{(I)} \oplus \psi{(I;p)} \in R^{N \times W_I \times H_I}} \&nbsp;\&nbsp;;\&nbsp;\&nbsp; {F(p) = \phi{(p)} \oplus \psi{(p;I)} \in R^{N \times W_p \times H_p}} \tag{2} \label{eq:o1}<br />
\end{align}<br />
<br />
where ''F(I)'' denotes the extended feature map for target image ''I'', ''F(p)'' denotes the extended feature map for query image ''p'' and <math>\oplus</math> denotes element-wise sum over the feature maps <math>\phi{}</math> and <math>\psi{}</math>.<br />
<br />
As can be seen above, the extended feature set for the target image ''I'' do not only contain features from ''I'' but also the weighted sum of the target image and the query image. The same can be observed for the query image. This weighted sum is a co-attention mechanism and with the help of extended feature maps, better proposals are generated when inputted to the RPN module.<br />
<br />
==== Squeeze and Co-Excitation ====<br />
<br />
The two feature maps generated from the non-local block above can be further related by identifying the important channels and therefore, re-weighting the weights of the channels. This is the basic purpose of this module. The Squeeze layer summarizes each feature map by applying Global Average Pooling (GAP) on the extended feature map for the query image. The Co-Excitation layer gives attention to feature channels that are important for evaluating the similarity metric. The whole block can be represented as:<br />
<br />
\begin{align}<br />
SCE(F(I), F(p)) = w \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{p}) = w \odot F(p) \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{I}) = w \odot F(I)\tag{3} \label{eq:op2}<br />
\end{align}<br />
<br />
where ''w'' is the excitation vector, <math>F(\tilde{p})</math> and <math>F(\tilde{I})</math> are the re-weighted features maps for query and target image respectively.<br />
<br />
In between the Squeeze layer and Co-Excitation layer, there exist two fully-connected layers followed by a sigmoid layer which helps to learn the excitation vector ''w''. The ''Channel Attention'' module in the architecture is basically these fully-connected layers followed by a sigmoid layer.<br />
<br />
==== Margin-based Ranking Loss ====<br />
<br />
The authors have defined a two-layer MLP network ending with a softmax layer to learn a similarity metric which will help rank the proposals generated by the RPN module. In the first stage of training, each proposal is annotated with 0 or 1 based on the IoU value of the proposal with the ground-truth bounding box. If the IoU value is greater than 0.5 then that proposal is labeled as 1 (foreground) and 0 (background) otherwise.<br />
<br />
Let ''q'' be the feature vector obtained after applying GAP to the query image patch obtained from the Squeeze and Co-Excitation block and ''r'' be the feature vector obtained after applying GAP to the region proposals generated by the RPN module. The two vectors are concatenated to form a new vector ''x'' which is the input to the two-layer MLP network designed. We can define ''x = [<math>r^T;q^T</math>]''. Let ''M'' be the model representing the two-layer MLP network, then <math>s_i = M(x_i)</math>, where <math>s_i</math> is the probability of <math>i^{th}</math> proposal being a foreground proposal based on the query image patch ''q''.<br />
<br />
The margin-based ranking loss is given by:<br />
<br />
\begin{align}<br />
L_{MR}(\{x_i\}) = \sum_{i=1}^{K}y_i \times max\{m^+ - s_i, 0\} + (1-y_i) \times max\{s_i - m^-, 0\} + \delta_{i} \tag{4} \label{eq:op3}<br />
\end{align}<br />
\begin{align}<br />
\delta_{i} = \sum_{j=i+1}^{K}[y_i = y_j] \times max\{|s_i - s_j| - m^-, 0\} + [y_i \ne y_j] \times max\{m^+ - |s_i - s_j|, 0\} \tag{5} \label{eq:op4}<br />
\end{align}<br />
<br />
where ''[.]'' is the Iversion bracket, i.e. the output will be 1 if the condition inside the bracket is true and 0 otherwise, <math>m^+</math> is the expected lower bound probability for predicting a foreground proposal, <math>m^-</math> is the expected upper bound probability for predicting a background proposal and <math>K</math> is the number of candidate proposals from RPN.<br />
<br />
The total loss for the model is given as:<br />
<br />
\begin{align}<br />
L = L_{CE} + L_{Reg} + \lambda \times L_{MR} \tag{6} \label{eq:op5}<br />
\end{align}<br />
<br />
where <math>L_{CE}</math> is the cross-entropy loss, <math>L_{Reg}</math> is the regression loss for bounding boxes of Faster R-CNN [1] and <math>L_{MR}</math> is the margin-based ranking loss defined above.<br />
<br />
For this paper, <math>m^+</math> = 0.7, <math>m^-</math> = 0.3, <math>\lambda</math> = 3, K = 128, C(z) in \eqref{eq:op} is the total number of elements in a single feature map of vector ''z'', and <math>f(x_i, z_j)</math> in \eqref{eq:op} is a dot product operation.<br />
\begin{align}<br />
f(x_i, z_j) = \alpha(x_i)^T \beta(z_j)\&nbsp;\&nbsp;;\&nbsp;\&nbsp;\alpha(x_i) = W_{\alpha} x_i \&nbsp;\&nbsp;;\&nbsp;\&nbsp; \beta(z_j) = W_{\beta} z_j \tag{7} \label{eq:op6}<br />
\end{align}<br />
<br />
== Results ==<br />
<br />
The model is trained and tested on two popular datasets, VOC and COCO. The ResNet-50 model was pre-trained on a reduced dataset by removing all the classes present in the COCO dataset, thus ensuring that the model has not seen any of the classes belonging to the inference images.<br />
<br />
==== Results on VOC Dataset ====<br />
<br />
[[File: voc_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 1:''' Results on VOC dataset</div><br />
<br />
For the VOC dataset, the model is trained on the union of VOC 2007 train and validation sets and VOC 2012 train and validation sets, whereas the model is tested on VOC 2007 test set. From the VOC results (Table 1), it can be seen that the model with pre-trained ResNet-50 on a reduced training set as the CNN backbone (Ours(725)) achieves better performance on seen and unseen classes than the baseline models. When the pre-trained ResNet-50 on the full training set (Ours(1K)) is used as the CNN backbone, then the performance of the model is increased significantly.<br />
<br />
==== Results on MSCOCO Dataset ====<br />
<br />
[[File: mscoco_splits.png|750px|center|Image: 500 pixels]]<br />
[[File: mscoco_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2:''' Results on COCO dataset</div><br />
<br />
The model is trained on the COCO train2017 set and evaluated on the COCO val2017 set. The classes are divided into four groups and the model is trained with images belonging to three splits, whereas the evaluation is done on the images belonging to the fourth split. From Table 2, it is visible that the model achieved better accuracy than the baseline model. The bar chart value in the split figure shows the performance of the model on each class separately. The model is having some difficulties when predicting images belonging to classes like the book (split2), handbag (split3), and tie (split4) because of variations in their shape and textures.<br />
<br />
==== Overall Performance ====<br />
For VOC, the model that uses the reduced ImageNet model backbone with 725 classes achieves a better performance on both the seen and unseen classes. Remarkable improvements in the performance are seen with the backbone with 1000 classes. For COCO, the model achieves better accuracy than the Siamese Mask-RCNN model for both the seen and unseen classes.<br />
<br />
== Ablation Studies ==<br />
<br />
==== Effect of all the proposed techniques on the final result ====<br />
<br />
[[File: one_shot_detector_results.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 3:''' Effect of all thre techniques combined</div><br />
<br />
Figure 3 shows the effect of the three proposed techniques on the evaluation metric. The model performs worst when neither Co-attention nor Co-excitation mechanism is used. But, when either Co-attention or Co-excitation is used then the performance of the model is improved significantly. The model performs best when all the three proposed techniques are used.<br />
<br />
<br />
In order to understand the effect of the proposed modules, the authors analyzed each module separately.<br />
<br />
==== Visualizing the effect of Non-local RPN ====<br />
<br />
To demonstrate the effect of Non-local RPN, a heatmap of generated proposals is constructed. Each pixel is assigned the count of how many proposals cover that particular pixel and the counts are then normalized to generate a probability map.<br />
<br />
[[File: one_shot_non_local_rpn.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 4:''' Visualization of Non-local RPN</div><br />
<br />
From Figure 4, it can be seen that when a non-local RPN is used instead of a conventional RPN, the model is able to give more attention to the relevant region in the target image.<br />
<br />
==== Analyzing and Visualizing the effect of Co-Excitation ====<br />
<br />
To visualize the effect of excitation vector ''w'', the vector is calculated for all images in the inference set which are then averaged over images belonging to the same class, and a pair-wise Euclidean distance between classes is calculated.<br />
<br />
[[File: one_shot_excitation.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 5:''' Visualization of Co-Excitation</div><br />
<br />
From Figure 5, it can be observed that the Co-Excitation mechanism is able to assign meaningful weight distribution to each class. The weights for classes related to animals are closer to each other and the ''person'' class is not close to any other class because of the absence of common attributes between ''person'' and any other class in the dataset.<br />
<br />
[[File: analyzing_co_excitation_1.png|Analyzing Co-Exitation|500px|left|bottom|Image: 500 pixels]]<br />
<br />
[[File: analyzing_co_excitation_2.png|Analyzing Co-Excitation|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 6:''' Analyzing Co-Exitation</div><br />
<br />
To analyze the effect of Co-Excitation, the authors used two different scenarios. In the first scenario (Figure 6, left), the same target image is used for different query images. <math>p_1</math> and <math>p_2</math> query images have a similar color as the target image whereas <math>p_3</math> and <math>p_4</math> query images have a different color object as compared to the target image. When the pair-wise Euclidean distance between the excitation vector in the four cases was calculated, it can be seen that <math>w_2</math> was closer to <math>w_1</math> as compared to <math>w_4</math> and <math>w_3</math> was closer to <math>w_4</math> as compared to <math>w_1</math>. Therefore, it can be concluded that <math>w_1</math> and <math>w_2</math> give more importance to the texture of the object whereas <math>w_3</math> and <math>w_4</math> give more importance to channels representing the shape of the object.<br />
<br />
The same observation can be analyzed in scenario 2 (Figure 6, right) where the same query image was used for different target images. <math>w_1</math> and <math>w_2</math> are closer to <math>w_a</math> than <math>w_b</math> whereas <math>w_3</math> and <math>w_4</math> are closer to <math>w_b</math> than <math>w_a</math>. Since images <math>I_1</math> and <math>I_2</math> have a similar color object as the query image, we can say that <math>w_1</math> and <math>w_2</math> give more weightage to the channels representing the texture of the object, and <math>w_3</math> and <math>w_4</math> give more weightage to the channels representing shape.<br />
<br />
== Conclusion ==<br />
<br />
The resulting one-shot object detector outperforms all the baseline models on VOC and COCO datasets. The authors have also provided insights about how the non-local proposals, serving as a co-attention mechanism, can generate relevant region proposals in the target image and put emphasis on the important features shared by both target and query image.<br />
<br />
== Critiques ==<br />
<br />
The techniques proposed by the authors improve the performance of the model significantly as we saw that when either of Co-attention or Co-excitation is used along with Margin-based ranking loss then the model can detect the instances of query object in the target image. Also, the model trained is generic and does not require any training/fine-tuning to detect any unseen classes in the target image. The loss metric designed makes the learning process not to rely on only the labels of images since the proposed metric annotates each proposal as a foreground or a background which is then used to calculate the metric.<br />
Since it is exploiting many deep neural networks inside the main architecture, one critique that comes across is how time-consuming the proposed model is. The paper could have elucidated it more thoroughly whether the method is too time-consuming or not.<br />
<br />
== Source Code==<br />
[https://github.com/timy90022/One-Shot-Object-Detection link One-Shot-Object-Detection]<br />
<br />
== References ==<br />
<br />
[1] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.<br />
<br />
[2] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pages 765–781, 2018<br />
<br />
[3] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. LSTD: A low-shot transfer detector for object detection. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 2836–2843, 2018.<br />
<br />
[4] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. CoRR, abs/1812.01866, 2018.<br />
<br />
[5] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=49375Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-12-06T10:58:53Z<p>A227jain: removed unwanted symbol</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. Mathematically, such an adversarial example <math>x'</math> satisfies <math>distance(x,x')=\delta, f(x)\neq f(x')</math>, where <math>\delta</math> is some small number and <math>f(\cdot)</math> is the image label. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minimally augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but these defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an inputted image in which the classification remains constant within a bounded set of perturbations around the original inputted image. Certified defenses have thus far considered <math>l_\text{p}</math>-bounded attacks where after labelling an input, if there does not exists an image resulting in a different label that is within the <math>l_\text{p}</math> norm ball of radius <math>\epsilon</math>, centred at the original input, then a certificate is issued. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The statement made by the certificate (i.e., that the input image is not an adversarial example in the chosen norm) is still technically correct, however in this case the adversary is hiding behind a certificate to avoid detection by a certifiable defense. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known Projected Gradient Descent (PGD) attack. PGD, a universal first-order adversary [4], is the only method to greatly improve NN model robustness among all the defenses appearing in ICLR2018 and CVPR2018 [5]. The fundamental idea of the PGD attack is the same where a bunch of adversarial images is created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
[[File:geese_attack.png|center]]<br />
<div align="center">'''Figure 1:''' Shadow attack used to build an adversarial example to target smoothed ImageNet classifier which results in a large certified radii. Even the adversarial perturbation is natural and smooth looking, it's measurement using lp-metrics is quite large. </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
== Source Code == <br />
<br />
The source code of the paper can be found here https://github.com/AminJun/BreakingCertifiableDefenses .<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in the norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs, and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=49374Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-12-06T10:58:00Z<p>A227jain: Added important note and improved grammatical mistakes</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. Mathematically, such an adversarial example <math>x'</math> satisfies <math>distance(x,x')=\delta, f(x)\neq f(x')</math>, where <math>\delta</math> is some small number and <math>f(\cdot)</math> is the image label. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minimally augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but these defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an inputted image in which the classification remains constant within a bounded set of perturbations around the original inputted image. Certified defenses have thus far considered <math>l_\text{p}</math>-bounded attacks where after labelling an input, if there does not exists an image resulting in a different label that is within the <math>l_\text{p}</math> norm ball of radius <math>\epsilon</math>, centred at the original input, then a certificate is issued. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The statement made by the certificate (i.e., that the input image is not an � adversarial example in the chosen norm) is still technically correct, however in this case the adversary is hiding behind a certificate to avoid detection by a certifiable defense. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known Projected Gradient Descent (PGD) attack. PGD, a universal first-order adversary [4], is the only method to greatly improve NN model robustness among all the defenses appearing in ICLR2018 and CVPR2018 [5]. The fundamental idea of the PGD attack is the same where a bunch of adversarial images is created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
[[File:geese_attack.png|center]]<br />
<div align="center">'''Figure 1:''' Shadow attack used to build an adversarial example to target smoothed ImageNet classifier which results in a large certified radii. Even the adversarial perturbation is natural and smooth looking, it's measurement using lp-metrics is quite large. </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
== Source Code == <br />
<br />
The source code of the paper can be found here https://github.com/AminJun/BreakingCertifiableDefenses .<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in the norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs, and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45662THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-22T08:46:15Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes to both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. Such GNNs are called aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. This work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors. Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
<br />
The following are the main contributions:<br />
<br />
1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate their findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
This paper concentrates on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, the authors aim to classify each graph node as true or false; in this paper, it is assumed that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
=== 1. LOGICAL NODE CLASSIFIER ===<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that they call these classifiers logical classifiers).<br />
<br />
=== 2. LOGIC FOC2 ===<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
=== 3. FOC2 AND AC-GNN CLASSIFIER ===<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
=== THE EXPRESSIVE POWER OF AC-GNNS ===<br />
AC-GNNs capture any FOC2 classifier as long as they further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, they are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: they can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
=== ACR-GNNs ===<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), they refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|700px|center|Image: 700 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally it combines the features of v with the two aggregation vectors.<br />
<br />
They know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
They next show that actually just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results out of using any number of layers as in the AC-GNN definition, together with a final layer that uses a readout function.<br />
<br />
== Experiments ==<br />
The authors performed experiments with synthetic data to empirically validate their results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, they also tested the GIN network proposed by Xu et al. (2019) (they consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Their experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided in three sets: train set with 5k graphs of size up to 50-100 nodes, test set with 500 graphs of size similar to the train set, and another test set with 500 graphs of size bigger than the train set. They tried several configurations for the aggregation, combination and readout functions, and report the accuracy on the best configuration. Accuracy in their experiments is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case they run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|600px|center|Image: 600 pixels]]<br />
<br />
[[File:a227_table2.png|560px|center|Image: 600 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what they expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
<br />
== Final Remarks ==<br />
<br />
The paper's results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, their work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding the results on the links between AC-GNNs and graded modal logic (Theorem 4.2), the very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how their results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The structure of the content is very well organized and the explanations are easy to understand for an average reader. They have also mentioned about the future work and possibilities. They could have given more inputs about the performance difference across different classifiers.<br />
<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45661THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-22T08:45:36Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes to both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. Such GNNs are called aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. This work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors. Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate their findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
This paper concentrates on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, the authors aim to classify each graph node as true or false; in this paper, it is assumed that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
=== 1. LOGICAL NODE CLASSIFIER ===<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that they call these classifiers logical classifiers).<br />
<br />
=== 2. LOGIC FOC2 ===<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
=== 3. FOC2 AND AC-GNN CLASSIFIER ===<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
=== THE EXPRESSIVE POWER OF AC-GNNS ===<br />
AC-GNNs capture any FOC2 classifier as long as they further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, they are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: they can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
=== ACR-GNNs ===<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), they refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|700px|center|Image: 700 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally it combines the features of v with the two aggregation vectors.<br />
<br />
They know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
They next show that actually just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results out of using any number of layers as in the AC-GNN definition, together with a final layer that uses a readout function.<br />
<br />
== Experiments ==<br />
The authors performed experiments with synthetic data to empirically validate their results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, they also tested the GIN network proposed by Xu et al. (2019) (they consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Their experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided in three sets: train set with 5k graphs of size up to 50-100 nodes, test set with 500 graphs of size similar to the train set, and another test set with 500 graphs of size bigger than the train set. They tried several configurations for the aggregation, combination and readout functions, and report the accuracy on the best configuration. Accuracy in their experiments is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case they run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|600px|center|Image: 600 pixels]]<br />
<br />
[[File:a227_table2.png|560px|center|Image: 600 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what they expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
<br />
== Final Remarks ==<br />
<br />
The paper's results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, their work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding the results on the links between AC-GNNs and graded modal logic (Theorem 4.2), the very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how their results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The structure of the content is very well organized and the explanations are easy to understand for an average reader. They have also mentioned about the future work and possibilities. They could have given more inputs about the performance difference across different classifiers.<br />
<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45660THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-22T08:44:49Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes to both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. Such GNNs are called aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. This work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate their findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
This paper concentrates on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, the authors aim to classify each graph node as true or false; in this paper, it is assumed that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
=== 1. LOGICAL NODE CLASSIFIER ===<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that they call these classifiers logical classifiers).<br />
<br />
=== 2. LOGIC FOC2 ===<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
=== 3. FOC2 AND AC-GNN CLASSIFIER ===<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
=== THE EXPRESSIVE POWER OF AC-GNNS ===<br />
AC-GNNs capture any FOC2 classifier as long as they further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, they are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: they can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
=== ACR-GNNs ===<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), they refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|700px|center|Image: 700 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally it combines the features of v with the two aggregation vectors.<br />
<br />
They know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
They next show that actually just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results out of using any number of layers as in the AC-GNN definition, together with a final layer that uses a readout function.<br />
<br />
== Experiments ==<br />
The authors performed experiments with synthetic data to empirically validate their results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, they also tested the GIN network proposed by Xu et al. (2019) (they consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Their experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided in three sets: train set with 5k graphs of size up to 50-100 nodes, test set with 500 graphs of size similar to the train set, and another test set with 500 graphs of size bigger than the train set. They tried several configurations for the aggregation, combination and readout functions, and report the accuracy on the best configuration. Accuracy in their experiments is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case they run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|600px|center|Image: 600 pixels]]<br />
<br />
[[File:a227_table2.png|560px|center|Image: 600 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what they expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
<br />
== Final Remarks ==<br />
<br />
The paper's results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, their work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding the results on the links between AC-GNNs and graded modal logic (Theorem 4.2), the very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how their results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The structure of the content is very well organized and the explanations are easy to understand for an average reader. They have also mentioned about the future work and possibilities. They could have given more inputs about the performance difference across different classifiers.<br />
<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45659THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-22T08:44:14Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes to both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. Such GNNs are called aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. This work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate their findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
This paper concentrates on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, the authors aim to classify each graph node as true or false; in this paper, it is assumed that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
=== 1. LOGICAL NODE CLASSIFIER ===<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that they call these classifiers logical classifiers).<br />
<br />
=== 2. LOGIC FOC2 ===<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
=== 3. FOC2 AND AC-GNN CLASSIFIER ===<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
=== THE EXPRESSIVE POWER OF AC-GNNS ===<br />
AC-GNNs capture any FOC2 classifier as long as they further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, they are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: they can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
=== ACR-GNNs ===<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), they refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|700px|center|Image: 700 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally it combines the features of v with the two aggregation vectors.<br />
<br />
They know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
They next show that actually just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results out of using any number of layers as in the AC-GNN definition, together with a final layer that uses a readout function.<br />
<br />
== Experiments ==<br />
The authors performed experiments with synthetic data to empirically validate their results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, they also tested the GIN network proposed by Xu et al. (2019) (they consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Their experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided in three sets: train set with 5k graphs of size up to 50-100 nodes, test set with 500 graphs of size similar to the train set, and another test set with 500 graphs of size bigger than the train set. They tried several configurations for the aggregation, combination and readout functions, and report the accuracy on the best configuration. Accuracy in their experiments is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case they run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|600px|center|Image: 600 pixels]]<br />
<br />
[[File:a227_table2.png|560px|center|Image: 600 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what they expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
<br />
== Final Remarks ==<br />
<br />
The paper's results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, their work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding the results on the links between AC-GNNs and graded modal logic (Theorem 4.2), the very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how their results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The structure of the content is very well organized and the explanations are easy to understand for an average reader. They have also mentioned about the future work and possibilities. They could have given more inputs about the performance difference across different classifiers.<br />
<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45658THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-22T08:41:51Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes to both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. Such GNNs are called aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. This work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate their findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
This paper concentrates on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, the authors aim to classify each graph node as true or false; in this paper, it is assumed that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
'''1. LOGICAL NODE CLASSIFIER<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that they call these classifiers logical classifiers).<br />
<br />
'''2. LOGIC FOC2<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
'''3. FOC2 AND AC-GNN CLASSIFIER<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
== THE EXPRESSIVE POWER OF AC-GNNS ==<br />
AC-GNNs capture any FOC2 classifier as long as they further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, they are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: they can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
== Concepts ==<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), they refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|700px|center|Image: 700 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally it combines the features of v with the two aggregation vectors.<br />
<br />
They know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
They next show that actually just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results out of using any number of layers as in the AC-GNN definition, together with a final layer that uses a readout function.<br />
<br />
== Experiments ==<br />
The authors performed experiments with synthetic data to empirically validate their results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, they also tested the GIN network proposed by Xu et al. (2019) (they consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Their experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided in three sets: train set with 5k graphs of size up to 50-100 nodes, test set with 500 graphs of size similar to the train set, and another test set with 500 graphs of size bigger than the train set. They tried several configurations for the aggregation, combination and readout functions, and report the accuracy on the best configuration. Accuracy in their experiments is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case they run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|600px|center|Image: 600 pixels]]<br />
<br />
[[File:a227_table2.png|560px|center|Image: 600 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what they expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
<br />
=== Final Remarks ===<br />
<br />
The paper's results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, their work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding the results on the links between AC-GNNs and graded modal logic (Theorem 4.2), the very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how their results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The structure of the content is very well organized and the explanations are easy to understand for an average reader. They have also mentioned about the future work and possibilities. They could have given more inputs about the performance difference across different classifiers.<br />
<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45657THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-22T08:40:57Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes to both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. Such GNNs are called aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. This work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate their findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
This paper concentrates on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, the authors aim to classify each graph node as true or false; in this paper, it is assumed that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|500px|center|Image: 500 pixels]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
'''1. LOGICAL NODE CLASSIFIER<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that they call these classifiers logical classifiers).<br />
<br />
'''2. LOGIC FOC2<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
'''3. FOC2 AND AC-GNN CLASSIFIER<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
== THE EXPRESSIVE POWER OF AC-GNNS ==<br />
AC-GNNs capture any FOC2 classifier as long as they further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, they are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: they can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
== Concepts ==<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), they refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|500px|center|Image: 500 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally it combines the features of v with the two aggregation vectors.<br />
<br />
They know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
They next show that actually just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results out of using any number of layers as in the AC-GNN definition, together with a final layer that uses a readout function.<br />
<br />
== Experiments ==<br />
The authors performed experiments with synthetic data to empirically validate their results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, they also tested the GIN network proposed by Xu et al. (2019) (they consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Their experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided in three sets: train set with 5k graphs of size up to 50-100 nodes, test set with 500 graphs of size similar to the train set, and another test set with 500 graphs of size bigger than the train set. They tried several configurations for the aggregation, combination and readout functions, and report the accuracy on the best configuration. Accuracy in their experiments is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case they run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|500px|center|Image: 500 pixels]]<br />
<br />
[[File:a227_table2.png|500px|center|Image: 500 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what they expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
<br />
=== Final Remarks ===<br />
<br />
The paper's results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, their work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding the results on the links between AC-GNNs and graded modal logic (Theorem 4.2), the very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how their results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The structure of the content is very well organized and the explanations are easy to understand for an average reader. They have also mentioned about the future work and possibilities. They could have given more inputs about the performance difference across different classifiers.<br />
<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45656THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-22T08:39:59Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes to both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. Such GNNs are called aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. This work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate our findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
This paper concentrates on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, the authors aim to classify each graph node as true or false; in this paper, it is assumed that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|500px|center|Image: 500 pixels]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
'''1. LOGICAL NODE CLASSIFIER<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that they call these classifiers logical classifiers).<br />
<br />
'''2. LOGIC FOC2<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
'''3. FOC2 AND AC-GNN CLASSIFIER<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
== THE EXPRESSIVE POWER OF AC-GNNS ==<br />
AC-GNNs capture any FOC2 classifier as long as they further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, they are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: they can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
== Concepts ==<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), they refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|500px|center|Image: 500 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally it combines the features of v with the two aggregation vectors.<br />
<br />
They know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
They next show that actually just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results out of using any number of layers as in the AC-GNN definition, together with a final layer that uses a readout function.<br />
<br />
== Experiments ==<br />
The authors performed experiments with synthetic data to empirically validate their results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, they also tested the GIN network proposed by Xu et al. (2019) (they consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Our experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided in three sets: train set with 5k graphs of size up to 50-100 nodes, test set with 500 graphs of size similar to the train set, and another test set with 500 graphs of size bigger than the train set. They tried several configurations for the aggregation, combination and readout functions, and report the accuracy on the best configuration. Accuracy in our experiments is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case they run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|500px|center|Image: 500 pixels]]<br />
<br />
[[File:a227_table2.png|500px|center|Image: 500 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what they expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
<br />
=== Final Remarks ===<br />
<br />
The paper's results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, our work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding the results on the links between AC-GNNs and graded modal logic (Theorem 4.2), the very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how our results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The structure of the content is very well organized and the explanations are easy to understand for an average reader. They have also mentioned about the future work and possibilities. They could have given more inputs about the performance difference across different classifiers.<br />
<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=45566stat940F212020-11-21T20:03:02Z<p>A227jain: Adding presentation video and summary link</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]] ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS Summary] || [https://drive.google.com/file/d/1mZVlF2UvJ2lGjuVcN5SYqBuO4jZjuCcU/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Do Not Review Yet] ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||A critical analysis of self-supervision, or what we can learn from a single image|| [https://openreview.net/pdf?id=B1esx6EYvr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Do Not Review Yet]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45537THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-21T18:24:04Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate our findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
We concentrate on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, we wish to classify each graph node as true or false; in this paper, we assume that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|500px|center|Image: 500 pixels]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
'''1. LOGICAL NODE CLASSIFIER<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that we call these classifiers logical classifiers).<br />
<br />
'''2. LOGIC FOC2<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, we have that FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
'''3. FOC2 AND AC-GNN CLASSIFIER<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
== THE EXPRESSIVE POWER OF AC-GNNS ==<br />
AC-GNNs capture any FOC2 classifier as long as we further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, we are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: we can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
== Concepts ==<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), we refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|500px|center|Image: 500 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally it combines the features of v with the two aggregation vectors.<br />
<br />
We know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
We next show that actually just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results out of using any number of layers as in the AC-GNN definition, together with a final layer that uses a readout function.<br />
<br />
== Experiments ==<br />
We perform experiments with synthetic data to empirically validate our results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, we also tested the GIN network proposed by Xu et al. (2019) (we consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Our experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided in three sets: train set with 5k graphs of size up to 50-100 nodes, test set with 500 graphs of size similar to the train set, and another test set with 500 graphs of size bigger than the train set. We tried several configurations for the aggregation, combination and readout functions, and report the accuracy on the best configuration. Accuracy in our experiments is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case we run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|500px|center|Image: 500 pixels]]<br />
<br />
[[File:a227_table2.png|500px|center|Image: 500 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what we expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
<br />
=== Final Remarks ===<br />
<br />
Our results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, our work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding our results on the links between AC-GNNs and graded modal logic (Theorem 4.2), we point out that very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how our results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The structure of the content is very well organized and the explanations are easy to understand for an average reader. They have also mentioned about the future work and possibilities. They could have given more inputs about the performance difference across different classifiers.<br />
<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45536THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-21T18:22:54Z<p>A227jain: Adding my presentation details</p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate our findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
We concentrate on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, we wish to classify each graph node as true or false; in this paper, we assume that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
'''1. LOGICAL NODE CLASSIFIER<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that we call these classifiers logical classifiers).<br />
<br />
'''2. LOGIC FOC2<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, we have that FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
'''3. FOC2 AND AC-GNN CLASSIFIER<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
== THE EXPRESSIVE POWER OF AC-GNNS ==<br />
AC-GNNs capture any FOC2 classifier as long as we further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, we are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: we can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
== Concepts ==<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), we refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally it combines the features of v with the two aggregation vectors.<br />
<br />
We know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
We next show that actually just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results out of using any number of layers as in the AC-GNN definition, together with a final layer that uses a readout function.<br />
<br />
== Experiments ==<br />
We perform experiments with synthetic data to empirically validate our results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, we also tested the GIN network proposed by Xu et al. (2019) (we consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Our experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided in three sets: train set with 5k graphs of size up to 50-100 nodes, test set with 500 graphs of size similar to the train set, and another test set with 500 graphs of size bigger than the train set. We tried several configurations for the aggregation, combination and readout functions, and report the accuracy on the best configuration. Accuracy in our experiments is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case we run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png]]<br />
<br />
[[File:a227_table2.png]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what we expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
<br />
=== Final Remarks ===<br />
<br />
Our results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, our work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding our results on the links between AC-GNNs and graded modal logic (Theorem 4.2), we point out that very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. Published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how our results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The structure of the content is very well organized and the explanations are easy to understand for an average reader. They have also mentioned about the future work and possibilities. They could have given more inputs about the performance difference across different classifiers.<br />
<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45516THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-21T17:32:13Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate our findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
We concentrate on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, we wish to classify each graph node as true or false; in this paper, we assume that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
'''1. LOGICAL NODE CLASSIFIER<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that we call these classifiers logical classifiers).<br />
<br />
'''2. LOGIC FOC2<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, we have that FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
'''3. FOC2 AND AC-GNN CLASSIFIER<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
== THE EXPRESSIVE POWER OF AC-GNNS ==<br />
AC-GNNs capture any FOC2 classifier as long as we further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, we are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: we can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
== Concepts ==<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), we refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i, is computed by the following formula:<br />
<br />
[[File:a227-formula-final]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:a227-formula-final.png&diff=45511File:a227-formula-final.png2020-11-21T17:24:57Z<p>A227jain: a227-formula-final.png</p>
<hr />
<div>a227-formula-final.png</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:a227_table2.png&diff=45509File:a227 table2.png2020-11-21T17:23:56Z<p>A227jain: </p>
<hr />
<div></div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:a227_table1.png&diff=45508File:a227 table1.png2020-11-21T17:23:29Z<p>A227jain: table 1</p>
<hr />
<div>table 1</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45497THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-21T16:49:28Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate our findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
We concentrate on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, we wish to classify each graph node as true or false; in this paper, we assume that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
'''1. LOGICAL NODE CLASSIFIER<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that we call these classifiers logical classifiers).<br />
<br />
'''2. LOGIC FOC2<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that, in terms of their logical expressiveness, we have that FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2, the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2).<br />
<br />
'''3. FOC2 AND AC-GNN CLASSIFIER<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
== THE EXPRESSIVE POWER OF AC-GNNS ==<br />
AC-GNNs capture any FOC2 classifier as long as we further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2, and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all subformulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether there is some node that satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, we are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: we can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, A logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45487THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-21T16:06:32Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate our findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
We concentrate on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, we wish to classify each graph node as true or false; in this paper, we assume that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png]]<br />
<br />
<br />
where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a Boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:a227-formula.png&diff=45486File:a227-formula.png2020-11-21T15:59:41Z<p>A227jain: a227-formula.png</p>
<hr />
<div>a227-formula.png</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45485THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-21T15:57:52Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate our findings showing that the theoretical expressiveness of ACR-GNNs,<br />
as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from<br />
examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not<br />
seen during training.<br />
<br />
== Architecture ==<br />
We concentrate on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, we wish to classify each graph node as true or false; in this paper, we assume that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors<br />
of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:formula.png]]<br />
<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
{AGG(i) }<br />
L<br />
i=1 and<br />
{COM(i)<br />
}<br />
L<br />
i=1 be two sets of aggregation and combination functions. An aggregate-combine GNN<br />
(AC-GNN) computes vectors x<br />
(i)<br />
v for every node v of the graph G, via the recursive formula<br />
<br />
Things are going well. Yesterday I spent most of my time having the discussions. There were a lot of integration queries and we were discussing how to tackle this. I have figured out most\<br />
of them but still, need to be revisited. I'll be pushing some of the changes today. They should resolve most of them.<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45484THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-21T15:57:00Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate our findings showing that the theoretical expressiveness of ACR-GNNs,<br />
as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from<br />
examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not<br />
seen during training.<br />
<br />
== Architecture ==<br />
We concentrate on the problem of Boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, we wish to classify each graph node as true or false; in this paper, we assume that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors<br />
of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:formula.jpg]]<br />
<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
{AGG(i) }<br />
L<br />
i=1 and<br />
{COM(i)<br />
}<br />
L<br />
i=1 be two sets of aggregation and combination functions. An aggregate-combine GNN<br />
(AC-GNN) computes vectors x<br />
(i)<br />
v for every node v of the graph G, via the recursive formula<br />
<br />
Things are going well. Yesterday I spent most of my time having the discussions. There were a lot of integration queries and we were discussing how to tackle this. I have figured out most\<br />
of them but still, need to be revisited. I'll be pushing some of the changes today. They should resolve most of them.<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45409THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-20T12:58:42Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs, which they call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
The following are the main contributions:<br />
<br />
'''1. They characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
'''2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, wherein each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
They experimentally validate our findings showing that the theoretical expressiveness of ACR-GNNs,<br />
as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from<br />
examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, ACGNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not<br />
seen during training.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known Projected Gradient Descent (PGD) attack. PGD, a universal first-order adversary [4], is the only method to greatly improve NN model robustness among all the defenses appearing in ICLR2018 and CVPR2018 [5]. The fundamental idea of the PGD attack is the same where a bunch of adversarial images is created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information on PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45408THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-20T12:34:04Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
We tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. We start by studying a popular class of GNNs, which we call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors.<br />
<br />
Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows to express all FOC2 classifiers? In this paper we provide answers to these two questions. The following are our main contributions.<br />
• We characterize exactly the fragment of FOC2 formulas that can be expressed as ACGNNs. This fragment corresponds to graded modal logic (de Rijke, 2000), or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
• Next we extend the AC-GNN architecture in a very simple way by allowing global readouts, where in each layer we also compute a feature vector for the whole graph and combine it with local aggregations; we call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, we prove that each FOC2 formula can be captured by an ACR-GNN.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known Projected Gradient Descent (PGD) attack. PGD, a universal first-order adversary [4], is the only method to greatly improve NN model robustness among all the defenses appearing in ICLR2018 and CVPR2018 [5]. The fundamental idea of the PGD attack is the same where a bunch of adversarial images is created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information on PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45407THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-20T12:05:47Z<p>A227jain: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data, e.g., molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. We call such GNNs aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. The author's work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. Mathematically, such an adversarial example <math>x'</math> satisfies <math>distance(x,x')=\delta, f(x)\neq f(x')</math>, where <math>\delta</math> is some small number and <math>f(\cdot)</math> is the image label. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minimally augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but these defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an inputted image in which the classification remains constant within a bounded set of perturbations around the original inputted image . Certified defenses have thus far considered <math>l_\text{p}</math>-bounded attacks where after labelling an input, if there does not exists an image resulting in a different label that is within the <math>l_\text{p}</math> norm ball of radius <math>\epsilon</math>, centred at the original input, then a certificate is issued. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known Projected Gradient Descent (PGD) attack. PGD, a universal first order adversary [4], is the only method to greatly improve NN model robustness among all the defenses appearing in ICLR2018 and CVPR2018 [5]. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS&diff=45361THE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-19T19:17:57Z<p>A227jain: Created page with " == Presented By == Abhinav Jain == Background == The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of..."</p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregation of the feature vectors of its neighbors. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. Mathematically, such an adversarial example <math>x'</math> satisfies <math>distance(x,x')=\delta, f(x)\neq f(x')</math>, where <math>\delta</math> is some small number and <math>f(\cdot)</math> is the image label. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minimally augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but these defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an inputted image in which the classification remains constant within a bounded set of perturbations around the original inputted image . Certified defenses have thus far considered <math>l_\text{p}</math>-bounded attacks where after labelling an input, if there does not exists an image resulting in a different label that is within the <math>l_\text{p}</math> norm ball of radius <math>\epsilon</math>, centred at the original input, then a certificate is issued. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known Projected Gradient Descent (PGD) attack. PGD, a universal first order adversary [4], is the only method to greatly improve NN model robustness among all the defenses appearing in ICLR2018 and CVPR2018 [5]. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_940-Proposal&diff=42678F21-STAT 940-Proposal2020-10-09T16:59:13Z<p>A227jain: Abhinav, Gautam: Adding tentative project details</p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
<br />
<br />
<br />
Project # 1 Group members:<br />
<br />
McWhannel, Pierre<br />
<br />
Yan, Nicole<br />
<br />
Hussein Salamah, Ahmed <br />
<br />
Title: placeholder<br />
<br />
Description: placeholder<br />
<br />
<br />
<br />
Project # 2 Group members:<br />
<br />
Singh, Gursimran<br />
<br />
Sharma, Govind<br />
<br />
Chanana, Abhinav<br />
<br />
Title: Quick Text Description using Headline Generation and Text To Image Conversion<br />
<br />
Description: An automatic tool to generate short description based on long textual data is a useful mechanism to share quick information. Most of the current approaches involve summarizing the text using varied deep learning approaches from Transformers to different RNNs. For this project, instead of building a standard text summarizer, we aim to provide two separate utilities for generating a quick description of the text. First, we plan to develop a model that produces a headline for the long textual data, and second, we are intending to generate an image describing the text. <br />
<br />
Headline Generation - Headline generation is a specific case of text summarization where the output is generally a combination of few words that gives an overall outcome from the text. In most cases, text summarization is an unsupervised learning problem. But, for the headline generation, we have the original headlines available in our training dataset that makes it a supervised learning task. We plan to experiment with different Recurrent Neural Networks like LSTMs and GRUs with varied architectures. For model evaluation, we are considering BERTScore using which we can compare the reference headline with the automatically generated headline from the model. We also aim to explore attention models for the text (headline) generation. We will make use of the currently available techniques mentioned in the various research papers but also try to develop our own architecture if the previous methods don't reveal reliable results on our dataset. Therefore, this task would primarily fit under the category of application of deep learning to a particular domain, but could also include some components of new algorithm design.<br />
<br />
Text to Image Conversion - Generation or synthesis of images from a short text description is another very interesting application domain in deep learning. One approach for image generation is based on mapping image pixels to specific features as described by the discriminative feature representation of the text. Recurrent Neural Networks have been successfully used in learning such feature representations of text. This approach is difficult to generalize because the recognition of discriminative features for texts in different domains is not an easy task and it requires domain expertise. Different generative methods have been used including Variational Recurrent Auto-Encoders and its extension in Deep Recurrent Attention Writer (DRAW). We plan to experiment with Generative Adversarial Networks (GAN). Application of GANs on domain-specific datasets has been done but we aim to apply different variants of GANs on the Microsoft COCO dataset which has been used in other architectures. The analysis will be focusing on how well GANs are able to generalize when compared to other alternatives on the given dataset.<br />
<br />
Scope - The above models will be trained independently on different datasets. Therefore, for a particular text, only one of the two functionalities will be available.<br />
<br />
<br />
<br />
Project # 3 Group members:<br />
<br />
Sikri, Gaurav<br />
<br />
Bhatia, Jaskirat<br />
<br />
Title: Not decided yet (Placeholder)<br />
<br />
Description: Not decided yet :)<br />
<br />
<br />
Project # 4 Group members:<br />
<br />
Maleki, Danial<br />
<br />
Rasoolijaberi, Maral<br />
<br />
Title: Binary Deep Neural Network for the domain of Pathology<br />
<br />
Description: The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. We want to investigate the possibility of using these types of networks in the domain of histopathology as it has gigapixels images which make the use of them very useful.<br />
<br />
<br />
Project # 5 Group members:<br />
<br />
Jain, Abhinav<br />
<br />
Bathla, Gautam<br />
<br />
Title: lyft-motion-prediction-autonomous-vehicles(Kaggle)(Tentative)<br />
<br />
Description: Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Comments: We are more inclined towards a 3-D object detection project. We are in the process of finding the right problem statement for it and if we are not successful, we will continue with the above Kaggle competition.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=42663stat940F212020-10-09T11:11:55Z<p>A227jain: Adding my presentation details</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || || 1|| || || ||<br />
|-<br />
|Week of Nov 2 || || 2|| || || ||<br />
|-<br />
|Week of Nov 2 || || 3|| || || ||<br />
|-<br />
|Week of Nov 2 || || 4|| || || ||<br />
|-<br />
|Week of Nov 2 || || 5|| || || ||<br />
|-<br />
|Week of Nov 2 || || 6|| || || ||<br />
|-<br />
|Week of Nov 9 || || 7|| || || ||<br />
|-<br />
|Week of Nov 9 || || 8|| || || ||<br />
|-<br />
|Week of Nov 9 || || 9|| || || ||<br />
|-<br />
|Week of Nov 9 || || 10|| || || ||<br />
|-<br />
|Week of Nov 9 || || 11|| || || ||<br />
|-<br />
|Week of Nov 9 || || 12|| || || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A Baseline for Few-Shot Image Classification || [https://openreview.net/pdf?id=rylXBkrYDS Paper] || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-</div>A227jain