stat441F18/YOLO: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
The '''You Only Look Once''' ('''YOLO''') object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM <sup>[[#References|[1]]]</sup> or regional proposal models such as the R-CNN variants <sup>[[#References|[2]]]</sup><sup>[[#References|[3]]]</sup><sup>[[#References|[4]]]</sup>, YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box. | |||
Some of the cited benefits of the YOLO model are: | |||
#''Extreme speed''. This is attributable to the framing of detection as a single regression problem. | |||
#''Learning of generalizable object representations''. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets. | |||
#''Global reasoning during prediction''. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context. | |||
=Presented by= | =Presented by= | ||
* Mitchell Snaith | msnaith@edu. | *Mitchell Snaith | msnaith@edu.uwaterloo.ca | ||
=Encoding predictions= | |||
The most important part of the YOLO model is its novel approach to prediction encoding. The input image is divided into an <math>S \times S</math> | |||
=References= | |||
* <sup>[https://ieeexplore.ieee.org/document/5255236 [1]]</sup>P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part | |||
based models. ''IEEE Transactions on Pattern Analysis and | |||
Machine Intelligence'', 32(9):1627–1645, 2010. | |||
* <sup>[https://arxiv.org/abs/1311.2524 [2]]</sup>R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic | |||
segmentation. In ''Computer Vision and Pattern Recognition | |||
(CVPR), 2014 IEEE Conference on'', pages 580–587. IEEE, | |||
2014. | |||
* <sup>[https://arxiv.org/abs/1504.08083 [3]]</sup>R. B. Girshick. Fast R-CNN. ''CoRR'', abs/1504.08083, 2015. | |||
* <sup>[https://arxiv.org/abs/1506.01497 [4]]</sup>S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. ''arXiv preprint arXiv:1506.01497'', 2015. | |||
'''Overview of network architecture''' | '''Overview of network architecture''' |
Revision as of 20:32, 19 November 2018
The You Only Look Once (YOLO) object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM [1] or regional proposal models such as the R-CNN variants [2][3][4], YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box.
Some of the cited benefits of the YOLO model are:
- Extreme speed. This is attributable to the framing of detection as a single regression problem.
- Learning of generalizable object representations. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets.
- Global reasoning during prediction. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context.
Presented by
- Mitchell Snaith | msnaith@edu.uwaterloo.ca
Encoding predictions
The most important part of the YOLO model is its novel approach to prediction encoding. The input image is divided into an [math]\displaystyle{ S \times S }[/math]
References
- [1]P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part
based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
- [2]R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic
segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.
- [3]R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
- [4]S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.
Overview of network architecture
layer | filters | stride | out dimension |
---|---|---|---|
Conv 1 | 7 x 7 x 64 | 2 | 224 x 224 x 64 |
Max Pool 1 | 2 x 2 | 2 | 112 x 112 x 64 |
Conv 2 | 3x3x192 | 1 | 112 x 112 x 192 |
Max Pool 2 | 2 x 2 | 2 | 56 x 56 x 192 |
Conv 3 | 1 x 1 x 128 | 1 | 56 x 56 x 128 |
Conv 4 | 3 x 3 x 256 | 1 | 56 x 56 x 256 |
Conv 5 | 1 x 1 x 256 | 1 | 56 x 56 x 256 |
Conv 6 | 1 x 1 x 512 | 1 | 56 x 56 x 512 |
Max Pool 3 | 2 x 2 | 2 | 28 x 28 x 512 |
Conv 7 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 8 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 9 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 10 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 11 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 12 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 13 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 14 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 15 | 1 x 1 x 512 | 1 | 28 x 28 x 512 |
Conv 16 | 3 x 3 x 1024 | 1 | 28 x 28 x 1024 |
Max Pool 4 | 2 x 2 | 2 | 14 x 14 x 1024 |
Conv 17 | 1 x 1 x 512 | 1 | 14 x 14 x 512 |
Conv 18 | 3 x 3 x 1024 | 1 | 14 x 14 x 1024 |
Conv 19 | 1 x 1 x 512 | 1 | 14 x 14 x 512 |
Conv 20 | 3 x 3 x 1024 | 1 | 14 x 14 x 1024 |
Conv 21 | 3 x 3 x 1024 | 1 | 14 x 14 x 1024 |
Conv 22 | 3 x 3 x 1024 | 2 | 7 x 7 x 1024 |
Conv 23 | 3 x 3 x 1024 | 1 | 7 x 7 x 1024 |
Conv 24 | 3 x 3 x 1024 | 1 | 7 x 7 x 1024 |
Fully Connected 1 | - | - | 4096 |
Fully Connected 2 | - | - | 7 x 7 x 30 |