stat441F18/YOLO: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
The '''You Only Look Once''' ('''YOLO''') object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM <sup>[[#References|[1]]]</sup> or regional proposal models such as the R-CNN variants <sup>[[#References|[2]]]</sup><sup>[[#References|[3]]]</sup><sup>[[#References|[4]]]</sup>, YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box.
Some of the cited benefits of the YOLO model are:
#''Extreme speed''. This is attributable to the framing of detection as a single regression problem.
#''Learning of generalizable object representations''. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets.
#''Global reasoning during prediction''. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context. 
=Presented by=
=Presented by=
* Mitchell Snaith | msnaith@edu.waterloo.ca
*Mitchell Snaith | msnaith@edu.uwaterloo.ca
 
=Encoding predictions=
 
The most important part of the YOLO model is its novel approach to prediction encoding. The input image is divided into an <math>S \times S</math>
 
=References=
* <sup>[https://ieeexplore.ieee.org/document/5255236 [1]]</sup>P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part
based models. ''IEEE Transactions on Pattern Analysis and
Machine Intelligence'', 32(9):1627–1645, 2010.
 
* <sup>[https://arxiv.org/abs/1311.2524 [2]]</sup>R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic
segmentation. In ''Computer Vision and Pattern Recognition
(CVPR), 2014 IEEE Conference on'', pages 580–587. IEEE,
2014.
 
* <sup>[https://arxiv.org/abs/1504.08083 [3]]</sup>R. B. Girshick. Fast R-CNN. ''CoRR'', abs/1504.08083, 2015.
 
* <sup>[https://arxiv.org/abs/1506.01497 [4]]</sup>S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. ''arXiv preprint arXiv:1506.01497'', 2015.


=Introduction=


'''Overview of network architecture'''
'''Overview of network architecture'''

Revision as of 21:32, 19 November 2018

The You Only Look Once (YOLO) object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM [1] or regional proposal models such as the R-CNN variants [2][3][4], YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box.

Some of the cited benefits of the YOLO model are:

  1. Extreme speed. This is attributable to the framing of detection as a single regression problem.
  2. Learning of generalizable object representations. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets.
  3. Global reasoning during prediction. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context.

Presented by

  • Mitchell Snaith | msnaith@edu.uwaterloo.ca

Encoding predictions

The most important part of the YOLO model is its novel approach to prediction encoding. The input image is divided into an [math]\displaystyle{ S \times S }[/math]

References

  • [1]P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part

based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.

  • [2]R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic

segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.

  • [3]R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
  • [4]S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.


Overview of network architecture

layer filters stride out dimension
Conv 1 7 x 7 x 64 2 224 x 224 x 64
Max Pool 1 2 x 2 2 112 x 112 x 64
Conv 2 3x3x192 1 112 x 112 x 192
Max Pool 2 2 x 2 2 56 x 56 x 192
Conv 3 1 x 1 x 128 1 56 x 56 x 128
Conv 4 3 x 3 x 256 1 56 x 56 x 256
Conv 5 1 x 1 x 256 1 56 x 56 x 256
Conv 6 1 x 1 x 512 1 56 x 56 x 512
Max Pool 3 2 x 2 2 28 x 28 x 512
Conv 7 1 x 1 x 256 1 28 x 28 x 256
Conv 8 3 x 3 x 512 1 28 x 28 x 512
Conv 9 1 x 1 x 256 1 28 x 28 x 256
Conv 10 3 x 3 x 512 1 28 x 28 x 512
Conv 11 1 x 1 x 256 1 28 x 28 x 256
Conv 12 3 x 3 x 512 1 28 x 28 x 512
Conv 13 1 x 1 x 256 1 28 x 28 x 256
Conv 14 3 x 3 x 512 1 28 x 28 x 512
Conv 15 1 x 1 x 512 1 28 x 28 x 512
Conv 16 3 x 3 x 1024 1 28 x 28 x 1024
Max Pool 4 2 x 2 2 14 x 14 x 1024
Conv 17 1 x 1 x 512 1 14 x 14 x 512
Conv 18 3 x 3 x 1024 1 14 x 14 x 1024
Conv 19 1 x 1 x 512 1 14 x 14 x 512
Conv 20 3 x 3 x 1024 1 14 x 14 x 1024
Conv 21 3 x 3 x 1024 1 14 x 14 x 1024
Conv 22 3 x 3 x 1024 2 7 x 7 x 1024
Conv 23 3 x 3 x 1024 1 7 x 7 x 1024
Conv 24 3 x 3 x 1024 1 7 x 7 x 1024
Fully Connected 1 - - 4096
Fully Connected 2 - - 7 x 7 x 30