Going Deeper with Convolutions: Difference between revisions

Revision as of 12:58, 6 November 2018

Introduction

This paper presents a deep convolutional neural network architecture codenamed Inception. This newly designed architecture enhance the utilization of the computing resources by increasing the depth and width of the network while maintaining the computational budget constant. The optimization of the model was achieved by the Hebbian principle (Footnote 1) and the intuition of multi-scale processing. The proposed architecture was implemented through a 22 layers deep network called GoogLeNet and significantly outperformed the state of the art in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

Previous Work

The current architecture is built on the network-in-network approach proposed by Lin et al. [1]. They added additional 1 X 1 convolutional layers, serving as dimension reduction modules to significantly reduce the number of parameters of the model. The paper also took inspiration from the Regions with Convolutional Neural Networks (R-CNN) proposed by Girshick et al. [2]. The overall detection problem is divided into two subproblems: to first utilize low-level cues for potential object proposals, and to then use CNN to classify object categories.

Motivation

The performance of deep neural networks can be improved by increasing the depth and the width of the networks. However, this suffers two major bottlenecks. One disadvantage is that the enlarged network tends to overfit the train data, especially if there is only limited labeled examples. The other drawback is the dramatic increase in computational resources when learning large number of parameters.

The fundamental way of handling both problems would be to use sparsely connected instead of fully connected networks and, at the same time, make numerical calculation on non-uniform sparse data structures efficient. Therefore, the inception architecture was motivated by Arora et al. [3] and Catalyurek et al. [4] and overcome these difficulties by clustering sparse matrices into relatively dense submatrices. It takes advantage of both extra sparsity and existing computational hardware.

Model Architecture:

The Inception architecture consists of stacking blocks called the inception modules. The idea is that to increase the depth and width of model by finding local optimal sparse structure and repeating it spatially. Traditionally, in each layer of convolutional network pooling operation and convolution and its size (1 by 1, 3 by 3 or 5 by 5) should be decided while all of them are beneficial for the modeling power of the network. Whereas, in Inception module instead of choosing, all these various options are computed simultaneously (Fig. 1a). Inspired by layer-by-layer construction of Arora et al. [3], in Inception module statistics correlation of the last layer is analyzed and clustered into groups of units with high correlation. These clusters form units of next layer and are connected to the units of previous layer. Each unit from the earlier layer corresponds to some region of the input image and the outputs of them are concatenated into a filter bank. Additionally, because of the beneficial effect of pooling in the convolutional networks, a parallel path of pooling has been added in each module. The Inception module in its naïve form (Fig. 1a) suffers from high computation and power cost. In addition, as the concatenated output from the various convolutions and the pooling layer will be an extremely deep channel of output volume, the claim that this architecture has an improved memory and computation power use looks like counterintuitive. However, this issue has been addressed by adding a 1 by 1 convolution before costly 3 by 3 and 5 by 5 convolutions. The idea of 1 by 1 convolution was first introduced by Lin et al. and called network in network [1]. This 1x1 convolution mathematically is equivalent to a multilayer perceptron which reduces the dimension of filter space (the depth of the output volume) and on top of that they also act as a non-linear rectifying activation layer ReLu to add to the non-linearity immediately after each 1 by 1 convolution (Fig. 1b). This enables less over-fitting due to smaller Kernel size (1 by 1). This distinctive dimensionality reduction feature of the 1 by 1 convolution allows shielding of the large number of input filters of the previous stage to the next stage (Footnote 2).

The combination of various layers of convolution has some similarity with human eyes in interpreting the visual information in a sense that human eyes also process the visual information at various scale and combines to extract the features from different scale simultaneously. Similarly, in inception design network in network designs extract the fine grain details of input volume while medium- and large-sized filters cover a large receptive field of the inputs and extract their features and with pooling operations overfitting can be overcome by reducing the spatial sizes.

ILSVRC 2014 Challenge Results:

The proposed architecture was implemented through a deep network called GoogLeNet as a submission for ILSVRC14’s Classification Challenge and Detection Challenge.

The classification challenge is to classify images into one of 1000 categories in the Imagenet hierarchy. The top-5 error rate - the percentage of test examples for which the correct class is not in the top 5 predicted classes - is used for measuring accuracy. The final submission of GoogLeNet obtains a top-5 error of 6.67% on both the validation and testing data, ranking first among all participants, significantly outperforming top teams in previous years, and not utilizing external data.

The ILSVRC detection challenge asks to produce bounding boxes around objects in images among 200 classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50%. Each image may contain multiple objects (with different scales) or none. The mean average precision (mAP) is used to report performance. Using the Inception model as a region classifier, combining Selective Search and using an ensemble of 6 CNNs, GoogLeNet gave top detection results, almost doubling accuracy of the the 2013 top model.

Conclusion:

Googlenet outperformed the other previous deep learning networks, and it became a proof of concept that approximating the expected optimal sparse structure by readily available dense building blocks (or the inception modules) is a viable method for improving the neural networks in computer vision. Even without performing any bounding box operations to detect objects, this architecture gained a significant amount of quality with a modest amount of computational resources.

Critiques

The paper's contributions towards patterning unordered network outputs and using associative embeddings for connecting vertices and edges are commendable. However, it should be noted this paper is only an incremental improvement over existing well-studied architectures like the hour glass architecture. The modifications also seem to be hacky. The authors say that they make a slight modification to the hourglass design and double the number of features and weight all the loses equally. No scientific justification for why this is needed is given. Also the choice of constants to be 3 and 6 for [math]\displaystyle{ s_o }[/math] and [math]\displaystyle{ s_r }[/math] is not clear, as the authors leave out a fraction of the cases. I am not sure if the changes made are truly a critical advance as the experiments are conducted only on a single dataset and no generalizability arguments are made by the authors. So the methods might just work well only for this dataset and the changes may pertain to only this one. The theoretical analysis done in the paper comes directly from the hourglass literature and cannot be accounted for novelty.

Appendices

Appendix 1: Sample Outputs

Appendix 2: Stacked Hourglass Architecture

Although this goes beyond the focus of the paper, I would like to add a brief overview of the stacked hourglass architecture used to generate the heat map. This architecture is unique in that it allows cyclical top-down, bottom-up inference and recombination of features. While most architectures focus on optimizing the bottom-up portion (reducing dimensionality), the stacked-hourglass gives the network more flexibility in how it generates a representation by allowing it to learn a series of down-sampling / up-sampling steps.

References

1. Alejandro Newell and Jia Deng, “Pixels to Graphs by Associative Embedding,” in NIPS, 2017

2. Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose Estimation. ECCV, 2016

3. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS, pages 91–99, 2015.

@@ Line 6: / Line 6: @@
 The current architecture is built on the network-in-network approach proposed by Lin et al. [1]. They added additional 1 X 1 convolutional layers, serving as dimension reduction modules to significantly reduce the number of parameters of the model. The paper also took inspiration from the Regions with Convolutional Neural Networks (R-CNN) proposed by Girshick et al. [2]. The overall detection problem is divided into two subproblems: to first utilize low-level cues for potential object proposals, and to then use CNN to classify object categories.
-== The Architecture: ==
+== Motivation ==
-: '''1. Detecting Graph Elements'''
-Given an image of dimensions h x w, a stacked hourglass (Appendix 2) is used to generate a h x w x f representation of the image. It should be noted that the dimension of the output (which is non-trainable),needs to fulfill certain criteria. Specifically, we need to have a resolution large enough to minimize the number of pixels with multiple detections while also being small enough to ensure that each 1 x 1 x f vector still contains the information needed for subsequent inference.
+The performance of deep neural networks can be improved by increasing the depth and the width of the networks. However, this suffers two major bottlenecks. One disadvantage is that the enlarged network tends to overfit the train data, especially if there is only limited labeled examples. The other drawback is the dramatic increase in computational resources when learning large number of parameters.
-A 1x1 convolution and sigmoid activation is performed on this result to generate a heat map (one for objects and one for relationships, using separately determined convolutions). The value at a given pixel can be interpreted as the likelihood of detection at that particular pixel in the original image.
+The fundamental way of handling both problems would be to use sparsely connected instead of fully connected networks and, at the same time, make numerical calculation on non-uniform sparse data structures efficient. Therefore, the inception architecture was motivated by Arora et al. [3] and Catalyurek et al. [4] and overcome these difficulties by clustering sparse matrices into relatively dense submatrices. It takes advantage of both extra sparsity and existing computational hardware.
-In order to claim that there is an element at some pixel, we need to have some likelihood threshold. Then, if a given pixel in the map has a value >= the threshold, we claim that there is an element at that pixel. This threshold is calculated by using binary cross-entropy loss on the final values in the heat map. Values with likelihoods greater than p-hat will be considered element detections.
-Finally, for each element that we detected, we extract the 1 x 1 x f feature vector. This is then used as an input to a set of Feed Forward Neural Networks (FFNNs), where we have a separate network for each characteristic of interest, and for each network, there's one hidden layer with f nodes. The object class and relationship (edges) could be supervised by softmax loss. Furthermore, in order to predict the bounding box of the object, we can use the approach proposed by the Faster-RCNN model[3]. The following image summarizes the process.
+== Model Architecture: ==
+The Inception architecture consists of stacking blocks called the inception modules. The idea is that to increase the depth and width of model by finding local optimal sparse structure and repeating it spatially. Traditionally, in each layer of convolutional network pooling operation and convolution and its size (1 by 1, 3 by 3 or 5 by 5) should be decided while all of them are beneficial for the modeling power of the network. Whereas, in Inception module instead of choosing, all these various options are computed simultaneously (Fig. 1a). Inspired by layer-by-layer construction of Arora et al. [3], in Inception module statistics correlation of the last layer is analyzed and clustered into groups of units with high correlation. These clusters form units of next layer and are connected to the units of previous layer. Each unit from the earlier layer corresponds to some region of the input image and the outputs of them are concatenated into a filter bank. Additionally, because of the beneficial effect of pooling in the convolutional networks, a parallel path of pooling has been added in each module. The Inception module in its naïve form (Fig. 1a) suffers from high computation and power cost. In addition, as the concatenated output from the various convolutions and the pooling layer will be an extremely deep channel of output volume, the claim that this architecture has an improved memory and computation power use looks like counterintuitive. However, this issue has been addressed by adding a 1 by 1 convolution before costly 3 by 3 and 5 by 5 convolutions. The idea of 1 by 1 convolution was first introduced by Lin et al. and called network in network [1]. This 1x1 convolution mathematically is equivalent to a multilayer perceptron which reduces the dimension of filter space (the depth of the output volume) and on top of that they also act as a non-linear rectifying activation layer ReLu to add to the non-linearity immediately after each 1 by 1 convolution (Fig. 1b). This enables less over-fitting due to smaller Kernel size (1 by 1). This distinctive dimensionality reduction feature of the 1 by 1 convolution allows shielding of the large number of input filters of the previous stage to the next stage (Footnote 2).
 [[File:Extraction Process.PNG|center]]
-:'''2. Connecting Elements with Associative Embeddings'''
+The combination of various layers of convolution has some similarity with human eyes in interpreting the visual information in a sense that human eyes also process the visual information at various scale and combines to extract the features from different scale simultaneously. Similarly, in inception design network in network designs extract the fine grain details of input volume while medium- and large-sized filters cover a large receptive field of the inputs and extract their features and with pooling operations overfitting can be overcome by reducing the spatial sizes.
-As explained earlier, to construct the scene graph, we need to know the source and destination of each edge. This is done through associative embeddings.
-First, let us define an embedding hi ϵ Rd produced for some vector i, and let us assume that we have n object detections in a particular image. Now, define hik, for k = 1 to Ki (where Ki is the number of edges in the graph with a vertex at vertex i) as the embedding associated with an edge that touches vertex i. We define two loss functions on these sets.
-<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 1.PNG]]</div>
-The goal of Lpull is to minimize the squared differences between the embedding of a given vertex and the embedding of an edge that references said vertex.
-<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Loss 2.PNG]]</div>
-On the other hand, minimizing Lpush implies assigning embeddings to vertices that are as far apart as possible. The further apart they are, the lower the output of max becomes until eventually, it reaches 0. Here, m is just a constant. In the paper, the values used were m = 8 and d = 8 (that is, 8D embeddings). Combining these two loss functions (and weighing them equally), accomplishes the task of predicting embeddings such that vertices are differentiated, but the embedding of a vertex is most similar to the vertex it references.
-:'''3. Support for Overlapping Detections'''
-An obvious concern is how the network would operate if there was more than one detection (be it object or relationship), in a given pixel. For example, detection of “shirt” and “person” may be centered at the exact same pixel. To account for this, the architecture is modified to allow for “slots” at each pixel. Specifically, so detections of objects are allowed at a particular pixel, while sr relationship detections are allowed at a given pixel.
-In order to allow for this, some changes are required after the feature extraction step. Specifically, we now use the 1x1xf vector as the input for so (or sr) different sets of 4 FFNNs, where the output (of the first three) is as shown in figure 2, and with the final FFNN outputting the probability of a detection existing in that particular slot, at that particular pixel. This new network is trained exclusively on whether or not a detection has been made in that slot, and, in prediction, is used to determine the number of slots to output at a given pixel. It is critical to note that this each of these so (or sr) sets of FFNNs share absolutely no weights. And each is trained for detection in its assigned slot.
-It is important to note that this implies a change in the training procedure. We now have so (or sr) different predictions (be it class, or class + bounding box), that we need to match with our set of ground truth detections at a given pixel. Without this step, we would not be able to assign a value to the error for that sample. To do this, we match a one-hot encoded vector of the ground-truth class and bounding box anchor (the reference vector), and then match them with the so (or sr) outputs provided at a given pixel. The Hungarian method is used to ensure maximum matching between the outputs and the reference method while ensuring we do not assign the same detection to multiple slots.
-==Results==
-A quick note on notation: R@50 indicates what percentage of ground-truth subject-predicate-object tuples appeared in a proposal of 50 such tuples. Since R@100 offers more possibilities, it will necessarily be higher. The 6.7, for example, indicates that 6.7% of the ground truth tuples appeared in the proposals of the network.
-The authors tested the network against two other architectures designed to develop a semantic understanding of images. For this, they used the Visual Genome dataset, with so = 3 and sr = 6. Overall, the new architecture vastly outperformed past models. The results were as follows:
-The table can be interpreted as follows:
-[[File:Results Table.PNG|center|600px]]
-::'''SGGen (no RPN)''': Given a particular image, without the use of Region Proposal networks, the accuracy of the proposed scene graph. No class predictions are provided.
+== ILSVRC 2014 Challenge Results: ==
-::'''SGGen (with RPN)''': Same as above, except the output of the Region Proposal Network, is used to enhance the input of a given image. No class predictions are provided.
+The proposed architecture was implemented through a deep network called GoogLeNet as a submission for ILSVRC14’s Classification Challenge and Detection Challenge.
-::'''SGCIs''': Ground-truth object bounding boxes are provided. The network is asked to classify them and determine relationships.
-::'''PredCIs''': As above, except the classes are also provided. The only goal is to predict relationships.
-Further analysis into the accuracy, when looking at predicates individually, shows that the architecture is very sensitive to over-represented relationship predicates.
+The classification challenge is to classify images into one of 1000 categories in the Imagenet hierarchy. The top-5 error rate -  the percentage of test examples for which the correct class is not in the top 5 predicted classes - is used for measuring accuracy. The final submission of GoogLeNet obtains a top-5 error of 6.67% on both the validation and testing data, ranking first among all participants, significantly outperforming top teams in previous years, and not utilizing external data.
-<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Results - Part 2.PNG]]</div>
+The ILSVRC detection challenge asks to produce bounding boxes around objects in images among 200 classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50%. Each image may contain multiple objects (with different scales) or none. The mean average precision (mAP) is used to report performance. Using the Inception model as a region classifier, combining Selective Search and using an ensemble of 6 CNNs, GoogLeNet gave top detection results, almost doubling accuracy of the the 2013 top model.
-As shown in Figure 5, for many ground-truth predicates (those that do not appear often in the ground truth), the network does poorly. Even when allowed to propose 100 tuples, the network does not offer the predicate. Figure 4 simply observes the fact that certain sets of relationship predicates appear predominantly in a subset of slots. No general explanation has been offered for this behavior.
-== Conclusion ==
+== Conclusion: ==
-In conclusion, the paper offers a novel approach that enables the extraction of image semantics while perpetually reasoning over the entire context of the image. Associative embeddings are used to connect object and predicate relationships, and parallel “slots” allow for multiple detections in one pixel. While this approach offers noticeable improvements in accuracy, it is clear that work needs to be done to account for the non-uniform distributions of relationships in the dataset.
+Googlenet outperformed the other previous deep learning networks, and it became a proof of concept that approximating the expected optimal sparse structure by readily available dense building blocks (or the inception modules) is a viable method for improving the neural networks in computer vision. Even without performing any bounding box operations to detect objects, this architecture gained a significant amount of quality with a modest amount of computational resources.