a fair comparison of graph neural networks for graph classification

From statwiki
Revision as of 19:16, 9 November 2020 by J6bhatia (talk | contribs)
Jump to navigation Jump to search

Presented By

Jaskirat Singh Bhatia

Background

Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce the results from such experiments to tackle the problem of ambiguity in experimental procedures and the impossibility of reproducing results. They also Standardized the experimental environment so that the results could be reproduced while using this environment.

Graph Neural Networks

1. A Neural Network which takes a Graph as an input

2. Tasks include classifying the graph or finding a missing edge/ node in the graph.

Problems in Papers

Some of the most common reproducibility problems encountered in this field concern hyperparameters selection and the correct usage of data splits for model selection versus model assessment. Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not standardized across different works in terms of node and edge features.

These issues easily generate doubts and confusion among practitioners that need a fully transparent and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through two different phases, namely model selection on the validation set and model assessment on the test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and biased estimates of the true performance of a model, making it hard for other researchers to present competitive results without following the same ambiguous evaluation procedures.

Risk Assessment and Model Selection

Risk Assesment

The goal of risk assessment is to provide an estimate of the performance of a class of models. When a test set is not explicitly given, a common way to proceed is to use k-fold Cross Validation. As model selection is performed independently for each training/test split, they obtain different “best” hyper-parameter configurations; this is why they refer to the performance of a class of models.

Model Selection

The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter configurations the one that works best on a specific validation set.

Reproducibility Issues

The GNN's were selected based on the following criteria:

1. Performances obtained with 10-fold CV

2. Peer reviews

3. Strong architectural differences

4. Popularity

Criteria to assess the quality of evaluation and reproducibility was as follows:

1. Code for data pre-processing

2. Code for model selection

3. Data splits are provided

4. Data is split by means of a stratification technique

5. Results of the 10-fold CV are reported correctly using standard deviations

Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:


insert table-------------------------------------------

Experiments

They re-evaluate the above-mentioned models on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely follows the rigorous practices as described earlier. In addition, they implemented two baselines whose purpose is to understand the extent to which GNNs are able to exploit structural information.