overfeat: integrated recognition, localization and detection using convolutional networks
The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. We also introduce a novel method for localization and detection by accumulating predicted bounding boxes. We suggest that by combining many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on background also lets the network focus solely on positive classes for higher accuracy.
Each image is assigned a single label corresponding to the main object in the image. Five guesses are allowed to find the correct answer (because images can also contain multiple unlabeled objects). During the train phase, this model uses the same fixed input size approach proposed by Krizhevsky et al. <ref name=KrA> Krizhevsky, Alex, et al "ImageNet Classiﬁcation with Deep Convolutional Neural Networks." in NIPS (2012). </ref>. This model maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.As depicted in Figure 1, this network contains eight layers with weights; the ﬁrst ﬁve are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. This network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.
Each image is downsampled so that the smallest dimension is 256 pixels. Then five random crops (and their horizontal flips) of size 221x221 pixels are extracted and presented to the network in mini-batches of size 128. The weights in the network are initialized randomly. They are then updated by stochastic gradient descent. Overﬁtting can be reduced by using “DropOut” <ref name=HiG> Hinton, Geoffrey, et al "Improving neural networks by preventing co-adaptation of feature detectors." arXiv:1207.0580, (2012). </ref> to prevent complex co-adaptations on the training data. On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present. DropOut is employed on the fully connected layers (6th and 7th) in the classifier. For training phase, multiple GPUs are used to increase the computation speed.
For test phase, the entire image is explored by densely running the network at each location and at multiple scales. This approach yields significantly more views for voting, which increases robustness while remaining efficient.
After classifying five objects in the image, a bounding box for each classified object is returned. The predicted box must match the groundtruth by at least 50% (using the PASCAL criterion of union over intersection), as well as be labeled with the correct class.
For localization, the classification-trained network is modified. to do so, classifier layers are replaced by a regression network and then trained to predict object bounding boxes at each spatial location and scale. Then regression predictions are combined together, along with the classification results at each location.