stat441F18/YOLO

From statwiki
Revision as of 21:32, 19 November 2018 by Msnaith (talk | contribs)
Jump to navigation Jump to search

The You Only Look Once (YOLO) object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM [1] or regional proposal models such as the R-CNN variants [2][3][4], YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box.

Some of the cited benefits of the YOLO model are:

  1. Extreme speed. This is attributable to the framing of detection as a single regression problem.
  2. Learning of generalizable object representations. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets.
  3. Global reasoning during prediction. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context.

Presented by

  • Mitchell Snaith | msnaith@edu.uwaterloo.ca

Encoding predictions

The most important part of the YOLO model is its novel approach to prediction encoding. The input image is divided into an [math]\displaystyle{ S \times S }[/math]

References

  • [1]P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part

based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.

  • [2]R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic

segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.

  • [3]R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
  • [4]S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.


Overview of network architecture

layer filters stride out dimension
Conv 1 7 x 7 x 64 2 224 x 224 x 64
Max Pool 1 2 x 2 2 112 x 112 x 64
Conv 2 3x3x192 1 112 x 112 x 192
Max Pool 2 2 x 2 2 56 x 56 x 192
Conv 3 1 x 1 x 128 1 56 x 56 x 128
Conv 4 3 x 3 x 256 1 56 x 56 x 256
Conv 5 1 x 1 x 256 1 56 x 56 x 256
Conv 6 1 x 1 x 512 1 56 x 56 x 512
Max Pool 3 2 x 2 2 28 x 28 x 512
Conv 7 1 x 1 x 256 1 28 x 28 x 256
Conv 8 3 x 3 x 512 1 28 x 28 x 512
Conv 9 1 x 1 x 256 1 28 x 28 x 256
Conv 10 3 x 3 x 512 1 28 x 28 x 512
Conv 11 1 x 1 x 256 1 28 x 28 x 256
Conv 12 3 x 3 x 512 1 28 x 28 x 512
Conv 13 1 x 1 x 256 1 28 x 28 x 256
Conv 14 3 x 3 x 512 1 28 x 28 x 512
Conv 15 1 x 1 x 512 1 28 x 28 x 512
Conv 16 3 x 3 x 1024 1 28 x 28 x 1024
Max Pool 4 2 x 2 2 14 x 14 x 1024
Conv 17 1 x 1 x 512 1 14 x 14 x 512
Conv 18 3 x 3 x 1024 1 14 x 14 x 1024
Conv 19 1 x 1 x 512 1 14 x 14 x 512
Conv 20 3 x 3 x 1024 1 14 x 14 x 1024
Conv 21 3 x 3 x 1024 1 14 x 14 x 1024
Conv 22 3 x 3 x 1024 2 7 x 7 x 1024
Conv 23 3 x 3 x 1024 1 7 x 7 x 1024
Conv 24 3 x 3 x 1024 1 7 x 7 x 1024
Fully Connected 1 - - 4096
Fully Connected 2 - - 7 x 7 x 30