Deep Residual Learning for Image Recognition Summary

From statwiki
Jump to navigation Jump to search

Introduction

Neural and convolutional neural networks have made significant progress in the field of classification. They consist of a number of layers each of which contain a number of nodes. The deeper the network, that is, the more layers a network has, the better it appears to be able to pick up on high level features in the data. However increasing the number of layers is difficult as it can result in issues with backpropagation, which is how the networks are trained. The 'vanishing gradient' problem has been addressed previously, and refers to instances where the gradient used in backpropagation becomes too small to make discernable differences as the parameters in the model are tuned, and thus the model is unable to evolve.

The purpose of residual learning is to solve the issue of degradation, which is the unexpected phenomenon where deeper networks have higher training and testing error than their shallower counterparts. In theory, this should not occur since the deeper network could be constructed as follows: assuming the shallower network has m layers, the first m layers of the deeper network are a copy of the shallower network, and the rest of the layers are identity layers whose output is the same as their input. This would result in the error values of the deeper network being at most equal to those of the shallower network. However, this result is not seen in practice.

ImageNet Classification

The authors used the ImageNet classification dataset to measure accuracy on different models.

They first compare plain nets to ResNets with 18 and 34 layers. The ResNets are identical save for the identity shortcuts joining each two layer stack (A). Note that in order to change dimensions, they use zero-padded shortcuts. The 34-layer plain net gives a higher training error than its 18-layer counterpart (the degradation problem; shown above in Table 2). This discrepancy is attributed to exponentially low convergence rates since BN is used which should prevent vanishing gradients. The 34 layer ResNet performs better on the train set than its 18-layer counterpart and performs better than the 34-layer plain net. So, the degradation problem is addressed.

The authors also consider the use of projection shortcuts where: (B) projection shortcuts are only used for changing dimensions and (C) all shortcuts are projections. The use of projections improves accuracy since additional residual learning is added. However, this improvement is not practical when compared to the added time complexity.

To train deeper nets, they used the identity shortcut and modified the structure to a bottleneck design. The bottleneck design uses a stack of three layers instead of two in each residual function (shortcut) and cuts the time complexity in half. When applied to 50-layer, 101-layer, and 152-layer ResNets, accuracy improves (Table 3). The 152-layer model won 1st place in ILSVRC 2015.

CIFAR-10 and Analysis

Using the CIFAR-10 dataset, which consists of 50k training images and 10k testing images in 10 classes, the following shows the results of experiments trained on the training set and evaluated on the test set. The aim is to compare the behaviours of extremely deep networks using plain architectures and residual networks. A higher training error is observed when going deeper for deep plain nets. This phenomenon is similar to what was observed with the ImageNet dataset, suggesting that such an optimization difficulty is a fundamental problem. However, similar to the ImageNet case, the proposed architecture manages to overcome the optimization difficulty and reach accuracy gains as the depth increases. The training and testing errors are shown in the following graphs: