Deep Residual Learning for Image Recognition: Difference between revisions
No edit summary |
No edit summary |
||
Line 5: | Line 5: | ||
= Introduction = | = Introduction = | ||
Image recognition algorithms have seen tremendous advancement over the last decade thanks to the development of more powerful GPUs and the increasingly available data sets, two components essential to training of deep architectures. One of the more recent progresses in the field is the development of the Deep Residual Learning network, an expansion on the traditional neural network concretized by a group of researchers led by K. He from Microsoft Research. This impressive breakthrough was awarded first place in the ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC 2015). | <p>Image recognition algorithms have seen tremendous advancement over the last decade thanks to the development of more powerful GPUs and the increasingly available data sets, two components essential to training of deep architectures. One of the more recent progresses in the field is the development of the Deep Residual Learning network, an expansion on the traditional neural network concretized by a group of researchers led by K. He from Microsoft Research. This impressive breakthrough was awarded first place in the ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC 2015).</p> | ||
Any neural network is built under two main principles: | <p>Any neural network is built under two main principles: | ||
<ol><li> | <ol><li>Modularity: increase the depth of a network by simply repeating a small module and aim to achieve higher accuracy</li> | ||
<li> | <li>Residual learning: add a new layer to the existing network only if can get something extra from that additional layer</li></ol> | ||
Based on these two principles, theoretically and given enough memory, neural networks may have as many layers as necessary. | |||
Yet, deeper networks possess two main problems: | Yet, deeper networks possess two main problems: | ||
If the network is too deep, training errors may be hard to propagate back correctly. For example, even in the simplest scenario where the activation function is the identity mapping, training errors may still increase along the backpropagation. | <ol><li>If the network is too deep, training errors may be hard to propagate back correctly. For example, even in the simplest scenario where the activation function is the identity mapping, training errors may still increase along the backpropagation.</li> | ||
If the layers are too narrow, the neural network may not learn enough representation power. This results in the degradation problem when the network starts to converge: accuracy gets saturated then degrades rapidly? | <li>If the layers are too narrow, the neural network may not learn enough representation power. This results in the degradation problem when the network starts to converge: accuracy gets saturated then degrades rapidly?</li></ol></p> | ||
To solve these two problems, K. He and his team developed the deep residual network algorithm, which allows deeper neural networks to be trained without sacrificing accuracy. This paper, “Deep Residual Learning for Image Recognition”, explains the concept of residual learning by first showcasing the structure of the network, followed by a comparison of experimental results to other models, and finally conclude with a discussion of some methodologies behind the network. In this report, we will attempt to explain this new methodology through our own words to ensure full understanding of the idea and exactly how it solved the issue of accuracy degradation. | To solve these two problems, K. He and his team developed the deep residual network algorithm, which allows deeper neural networks to be trained without sacrificing accuracy. This paper, “Deep Residual Learning for Image Recognition”, explains the concept of residual learning by first showcasing the structure of the network, followed by a comparison of experimental results to other models, and finally conclude with a discussion of some methodologies behind the network. In this report, we will attempt to explain this new methodology through our own words to ensure full understanding of the idea and exactly how it solved the issue of accuracy degradation. |
Revision as of 11:57, 14 November 2018
Presented by
Hanzhen Yang, Jing Pu Sun, Ganyuan Xuan, Yu Su, Jiacheng Weng, Keqi Li, Yi Qian, Bomeng Liu
Introduction
Image recognition algorithms have seen tremendous advancement over the last decade thanks to the development of more powerful GPUs and the increasingly available data sets, two components essential to training of deep architectures. One of the more recent progresses in the field is the development of the Deep Residual Learning network, an expansion on the traditional neural network concretized by a group of researchers led by K. He from Microsoft Research. This impressive breakthrough was awarded first place in the ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC 2015).
Any neural network is built under two main principles:
- Modularity: increase the depth of a network by simply repeating a small module and aim to achieve higher accuracy
- Residual learning: add a new layer to the existing network only if can get something extra from that additional layer
Based on these two principles, theoretically and given enough memory, neural networks may have as many layers as necessary. Yet, deeper networks possess two main problems:
- If the network is too deep, training errors may be hard to propagate back correctly. For example, even in the simplest scenario where the activation function is the identity mapping, training errors may still increase along the backpropagation.
- If the layers are too narrow, the neural network may not learn enough representation power. This results in the degradation problem when the network starts to converge: accuracy gets saturated then degrades rapidly?
To solve these two problems, K. He and his team developed the deep residual network algorithm, which allows deeper neural networks to be trained without sacrificing accuracy. This paper, “Deep Residual Learning for Image Recognition”, explains the concept of residual learning by first showcasing the structure of the network, followed by a comparison of experimental results to other models, and finally conclude with a discussion of some methodologies behind the network. In this report, we will attempt to explain this new methodology through our own words to ensure full understanding of the idea and exactly how it solved the issue of accuracy degradation.
Background
Modeling
Convolutional deep neural network is constructed by simply stacking many layers together. Such formulation of deep network exists two problems during training: 1) vanishing gradient problem; 2) degradation of accuracy [1]. In this section, detailed explanation of these two problems and corresponding solutions are discussed.
Vanishing / Exploding Gradient
Deep neural networks are trained using back propagation. The error gradient with respect to weight parameters at shallower layers can be expressed as a chain rule expansion of parameters at deeper layers. When a network has large number of layers, the gradient tends to vanish or explode during back propagation.
Consider a simple example of feedforward neural network with only one neuron at each layer as shown in figure [XX1].
Figure [XX1]: simple feedforward neural network with one neuron at hidden layers
The error gradient of weight [math]\displaystyle{ w_1 }[/math] can be expressed as:
[math]\displaystyle{ \frac{\partial Err}{\partial w_1} = \frac{\partial Err}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial x_3}\frac{\partial x_3}{\partial x_2}\frac{\partial x_2}{\partial w_1} \\ \ \ \ \ \ \ \ \ = \frac{\partial Err}{\partial \hat{y}} \cdot w_3 \cdot \sigma'(x_2 w_2) \cdot w_2 \cdot \sigma'(x_1 w_1) \cdot x_1 }[/math]
The activation function at each neuron is commonly selected to be Relu to avoid vanishing gradient when performing differentiation [REF]. When weight [math]\displaystyle{ w_3 }[/math] and [math]\displaystyle{ w_2 }[/math] are less than 1, the error gradient with respect to [math]\displaystyle{ w_1 }[/math] can be small due to multiplication of small numbers (a.k.a. vanishing gradient). When [math]\displaystyle{ w_3 }[/math] and [math]\displaystyle{ w_2 }[/math] are greater than 1, the error gradient with respect to [math]\displaystyle{ w_1 }[/math] can be large (a.k.a. exploding gradient). Both of these situations increase the difficulty of training.
degradation of accuracy
From empirical results, neural networks with stochastic gradient descent has difficulty optimizing the unnecessary weight to identity [1]. That is, adding more layers to a optimized model can increase the training error which contradicts to common understanding of neural networks. Figure [XX2] shows the comparison between training error of shallow and deep networks.
Figure [XX2]: Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks [1]
Assuming the deeper network is capable of training identity on the extra layers, it should produce training error no larger than the shallower network. This suggests that convolutional formulation of neural networks has difficulty of training identity.
Solution
To resolve vanishing / exploding gradient and degradation of accuracy, residual network (Resnet) is proposed by K. He, etc [1]. The mathematical explanation are discussed in this section.
FIgure [XX3] shows a representation of Resnet building block. The main change of the Resnet is to include a short-cut that passes information directly from shallower layer to deeper layer.
Figure [XX3]: Residual learning: a building block [1]
Table [XX1] illustrates the comparison of training identity in a Resnet and a convolutional neural network. Relu operation of x is denoted as [math]\displaystyle{ f(x) }[/math]. Assuming x1 is the output of a previous building block, thus [math]\displaystyle{ ReLU(x_1) = x_1 }[/math]. In order to train identity in a Resnet building block, weight matrix need to be trained to 0 instead of I which is assumed to be easier[1]. This is further proven in the Experiment / Results section.
Table [XX1]: Comparison between no short-cut and with short-cut formulation
Expression of [math]\displaystyle{ x_3 }[/math] | Condition for [math]\displaystyle{ x_1 = x_3 }[/math] | |
No short-cut | [math]\displaystyle{ x_2 = f(W_2 \cdot f(W_1x_1)) }[/math] | [math]\displaystyle{ W_1 = W_2 = I }[/math] |
With short-cut | [math]\displaystyle{ x_2 = f(W_2 \cdot f(W_1x_1)) + x_1 }[/math] | [math]\displaystyle{ W_1 = 0 }[/math] or [math]\displaystyle{ W_2 = 0 }[/math] |
The formulation of Resnet also addresses the vanishing / exploding gradient problem. According to figure [XX3], we denote [math]\displaystyle{ F(x_1, W_1^{'}) = W_2 f(W_1x_1) }[/math] where [math]\displaystyle{ W_1^{'} }[/math] is the equivalent weight parameters, then [math]\displaystyle{ x_2 = x_1 + F(x_1, W_1^{'}) }[/math]. To generalize the index, we obtain the following equations [3]:
[math]\displaystyle{ y_l = x_l + F(x_l, W_l^{'}) \\ x_{l+1} = f(y_l) }[/math]
Recursively, we can express x at any deeper layer L using information from shallower layer as following [3]:
By rearranging equation above, and differentiating with chain rule, the following expression is obtained [3]:
[math]\displaystyle{ \frac{\partial Err}{\partial x_l} = \frac{\partial Err}{\partial x_L} \frac{\partial x_L}{\partial x_l} = \frac{\partial Err}{\partial x_L} (1+ \frac{\partial }{\partial x_l} \sum_{i=1}^{L-1} F(x_i, W_i^{'})) }[/math]
This suggests that the gradient at shallower layer [math]\displaystyle{ l }[/math] sees information directly from deeper layer [math]\displaystyle{ L }[/math]. To make the error gradient at layer [math]\displaystyle{ l }[/math] vanish, the derivative term of [math]\displaystyle{ \sum F }[/math]must be equal to -1 which is very difficult to achieve in a batch learning process [3]. Since error gradient with respect to [math]\displaystyle{ x_l }[/math] is part of back propagation process, preventing vanishing gradient wrt. [math]\displaystyle{ x_l }[/math] can effectively prevent vanishing gradient of [math]\displaystyle{ W_{l-1} }[/math].
Intuition of Resnet
Figure [XX4] shows a interesting way of understanding Resnet. By expanding a simple Resnet with 3 building block, a graph with all possible path of information is constructed. Therefore, the Resnet can also be seen as a majority voting process.
Figure [XX4]: Understand Resnet as a majority voting process [4]
Experiment / Results
Conclusion
Reference
Appendix
VLAD
The area of image recognition and object retrieval has seen a steady trend of improvements in performance, where one of the most significant contribution is the introduction of the Vector of Locally Aggregated Descriptors (VLAD). It is designed to fit very large image datasets (e.g. 1 billion images) into main memory with its low dimension (e.g. 16 byte per image). https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/arandjelovic13.pdf
Fisher Vector
The Fisher Vector is an image representation obtained by sampling local image features, fitting Gaussian Mixture Model on those features, and resulting in vocabulary of dominant features in the image and their distributions. Form each Gaussian distribution, we measure the expectation of distance of image features using the likelihood a feature belongs to certain Gaussian. Concentrating result vector of each vocabulary into one large descriptor vector, normalize it, we will get the Fisher Vector. https://jacobgil.github.io/machinelearning/fisher-vectors-python
Multigrid Method
Multigrid Methods are algorithms for solving differential equations using a hierarchy of discretization. The main idea is to accelerate the convergence of basic iterative method with three significant strategies: smoothing, restriction and interpolation/prolongation. Which are used to reduce high frequency errors, down-sample the residual error to a coarser grid, and interpolate a correction computed on a coarser grid into a finer grid, respectively. https://www.wias-berlin.de/people/john/LEHRE/MULTIGRID/multigrid.pdf
VGG
VGG refers to a deep convolutional network for object recognition developed and trained by Oxford’s renowned Visual Geometry Group. VGGNet scored 1st place on image localization task and 2nd place on image detection task in the Image Net Large Scale Visual Recognition Challenge (ILSVRC) in 2014. VGG-16 network architecture and its python code listed below. https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3#file-vgg-16_keras-py
ReLU
The Rectified Linear Unit (ReLU) is one of the most commonly used activation function in deep learning models. It is a non-negative function which returns 0 if the input is less than 0 and return the input itself otherwise. It can be written as f(x)=max(0,x) . Graphically it looks like
https://www.kaggle.com/dansbecker/rectified-linear-units-relu-in-deep-learning
Convolutional Neural Network (CNN)
CNN is similar to ordinary neural networks that are made up of neurons with learnable weights and biases. Each of the neurons receives some inputs, performs a linear transformation and follows with a non-linear activation function. The difference is that CNN allow us to encode certain properties into the architecture since the explicit assumption is the inputs are images. A regular neuron receive a single vector as its input, which means a matrix input will be vectorized into a single column and achieve full connectivity. It will boost the number of parameters and lead to overfitting. CNN have neurons arranged in 3 dimensions: height, width and depth. The neurons in a layer will only be connected to a necessary portion of the layer before it, instead of wasteful fully connection. This will make CNN achieve higher accuracy and efficiency when dealing with image recognitions comparing to the ordinary neural network. A visualization is:
Left: A regular Neural Network. Right: A CNN arranges its neurons in three dimensions (width, height, depth). http://cs231n.github.io/convolutional-networks/
CIFAR-10
The CIFAR-10 is an established computer-vision dataset used for image recognition. It consists of 60000 32x32 color images which split into 10 completely mutually exclusive classes evenly. There are 50000 training images and 10000 test images. It is widely used in image recognition competitions and tasks.