stat441F18/YOLO
The You Only Look Once (YOLO) object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM [1] or regional proposal models such as the R-CNN variants [2][3][4], YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box.
Some of the cited benefits of the YOLO model are:
- Extreme speed. This is attributable to the framing of detection as a single regression problem.
- Learning of generalizable object representations. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets.
- Global reasoning during prediction. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context.
Presented by
- Mitchell Snaith | msnaith@edu.uwaterloo.ca
Encoding predictions
The most important part of the YOLO model is its novel approach to prediction encoding. The input image is divided into an [math]\displaystyle{ S \times S }[/math]
References
- [1]P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part
based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
- [2]R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic
segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.
- [3]R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
- [4]S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.
Overview of network architecture
layer | filters | stride | out dimension |
---|---|---|---|
Conv 1 | 7 x 7 x 64 | 2 | 224 x 224 x 64 |
Max Pool 1 | 2 x 2 | 2 | 112 x 112 x 64 |
Conv 2 | 3x3x192 | 1 | 112 x 112 x 192 |
Max Pool 2 | 2 x 2 | 2 | 56 x 56 x 192 |
Conv 3 | 1 x 1 x 128 | 1 | 56 x 56 x 128 |
Conv 4 | 3 x 3 x 256 | 1 | 56 x 56 x 256 |
Conv 5 | 1 x 1 x 256 | 1 | 56 x 56 x 256 |
Conv 6 | 1 x 1 x 512 | 1 | 56 x 56 x 512 |
Max Pool 3 | 2 x 2 | 2 | 28 x 28 x 512 |
Conv 7 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 8 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 9 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 10 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 11 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 12 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 13 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 14 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 15 | 1 x 1 x 512 | 1 | 28 x 28 x 512 |
Conv 16 | 3 x 3 x 1024 | 1 | 28 x 28 x 1024 |
Max Pool 4 | 2 x 2 | 2 | 14 x 14 x 1024 |
Conv 17 | 1 x 1 x 512 | 1 | 14 x 14 x 512 |
Conv 18 | 3 x 3 x 1024 | 1 | 14 x 14 x 1024 |
Conv 19 | 1 x 1 x 512 | 1 | 14 x 14 x 512 |
Conv 20 | 3 x 3 x 1024 | 1 | 14 x 14 x 1024 |
Conv 21 | 3 x 3 x 1024 | 1 | 14 x 14 x 1024 |
Conv 22 | 3 x 3 x 1024 | 2 | 7 x 7 x 1024 |
Conv 23 | 3 x 3 x 1024 | 1 | 7 x 7 x 1024 |
Conv 24 | 3 x 3 x 1024 | 1 | 7 x 7 x 1024 |
Fully Connected 1 | - | - | 4096 |
Fully Connected 2 | - | - | 7 x 7 x 30 |