overfeat: integrated recognition, localization and detection using convolutional networks

From statwiki
Revision as of 01:58, 23 October 2015 by Amirlk (talk | contribs) (Created page with "= Introduction = The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the clas...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Introduction

The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. We also introduce a novel method for localization and detection by accumulating predicted bounding boxes. We suggest that by combining many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on background also lets the network focus solely on positive classes for higher accuracy.

Vision Tasks

Classification

Each image is assigned a single label corresponding to the main object in the image. Five guesses are allowed to find the correct answer (because images can also contain multiple unlabeled objects). During the training phase, this model uses the same fixed input size approach proposed by Krizhevsky et al.<ref name=KrA> Krizhevsky, Alex, et al [www.cs.toronto.edu/~fritz/absps/imagenet.pdf "ImageNet Classification with Deep Convolutional Neural Networks."] in NIPS (2012). </ref>. This model maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.


Localization

After classifying five objects in the image, a bounding box for each classified object is returned. The predicted box must match the groundtruth by at least 50% (using the PASCAL criterion of union over intersection), as well as be labeled with the correct class.


Detection