The You Only Look Once (YOLO) object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM ^[1] or regional proposal models such as the R-CNN variants ^[2]^[3]^[4], YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box.

Some of the cited benefits of the YOLO model are:

Extreme speed. This is attributable to the framing of detection as a single regression problem.
Learning of generalizable object representations. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets.
Global reasoning during prediction. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context.

Presented by

Mitchell Snaith | msnaith@edu.uwaterloo.ca

Encoding predictions

The most important part of the YOLO model is its novel approach to prediction encoding. The input image is divided into an [math]\displaystyle{ S \times S }[/math]

References

^[1]P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part

based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.

^[2]R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic

segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.

^[3]R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.

^[4]S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.

Overview of network architecture


layer	filters	stride	out dimension
Conv 1	7 x 7 x 64	2	224 x 224 x 64
Max Pool 1	2 x 2	2	112 x 112 x 64
Conv 2	3x3x192	1	112 x 112 x 192
Max Pool 2	2 x 2	2	56 x 56 x 192
Conv 3	1 x 1 x 128	1	56 x 56 x 128
Conv 4	3 x 3 x 256	1	56 x 56 x 256
Conv 5	1 x 1 x 256	1	56 x 56 x 256
Conv 6	1 x 1 x 512	1	56 x 56 x 512
Max Pool 3	2 x 2	2	28 x 28 x 512
Conv 7	1 x 1 x 256	1	28 x 28 x 256
Conv 8	3 x 3 x 512	1	28 x 28 x 512
Conv 9	1 x 1 x 256	1	28 x 28 x 256
Conv 10	3 x 3 x 512	1	28 x 28 x 512
Conv 11	1 x 1 x 256	1	28 x 28 x 256
Conv 12	3 x 3 x 512	1	28 x 28 x 512
Conv 13	1 x 1 x 256	1	28 x 28 x 256
Conv 14	3 x 3 x 512	1	28 x 28 x 512
Conv 15	1 x 1 x 512	1	28 x 28 x 512
Conv 16	3 x 3 x 1024	1	28 x 28 x 1024
Max Pool 4	2 x 2	2	14 x 14 x 1024
Conv 17	1 x 1 x 512	1	14 x 14 x 512
Conv 18	3 x 3 x 1024	1	14 x 14 x 1024
Conv 19	1 x 1 x 512	1	14 x 14 x 512
Conv 20	3 x 3 x 1024	1	14 x 14 x 1024
Conv 21	3 x 3 x 1024	1	14 x 14 x 1024
Conv 22	3 x 3 x 1024	2	7 x 7 x 1024
Conv 23	3 x 3 x 1024	1	7 x 7 x 1024
Conv 24	3 x 3 x 1024	1	7 x 7 x 1024
Fully Connected 1	-	-	4096
Fully Connected 2	-	-	7 x 7 x 30

stat441F18/YOLO

Presented by

Encoding predictions

References

Navigation menu

stat441F18/YOLO

Presented by

Encoding predictions

References

Navigation menu

Search