The You Only Look Once (YOLO) object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM ^[1] or regional proposal models such as the R-CNN variants ^[2]^[3]^[4], YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box.

Some of the cited benefits of the YOLO model are:

Extreme speed. This is attributable to the framing of detection as a single regression problem.
Learning of generalizable object representations. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets.
Global reasoning during prediction. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context.

Presented by

Mitchell Snaith | msnaith@edu.uwaterloo.ca

Encoding predictions

The most important part of the YOLO model is its novel approach to prediction encoding.

The input image is first divided into an [math]\displaystyle{ S \times S }[/math] grid of cells. Now for the given bounding boxes of objects in the input image, the YOLO model takes the central point of the bounding box and links it to the grid cell in which it is contained. This cell will be responsible for the detection of that object.

Now for each grid cell, YOLO predicts [math]\displaystyle{ B }[/math] bounding boxes along with confidence scores. These bounding boxes are predicted directly with fully connected layers at the end of the single neural network, as seen later in the network architecture. Each bounding box is comprised of 5 predictions:

[math]\displaystyle{ \begin{align*} &\left. \begin{aligned} x \\ y \end{aligned} \right\rbrace \text{the center of the bounding box relative to the grid cell} \\ &\left. \begin{aligned} w \\ h \end{aligned} \right\rbrace \text{the width and height of the bounding box relative to the whole input} \\ &\left. \begin{aligned} p_c \end{aligned} \right\rbrace \text{the confidence of presence of an object of any class} \end{align*} }[/math]

[math]\displaystyle{ (x, y) }[/math] and [math]\displaystyle{ (w, h) }[/math] are normalized to the range [math]\displaystyle{ (0, 1) }[/math]. Further, [math]\displaystyle{ p_c }[/math] in this context is defined as follows:

[math]\displaystyle{ p_c = P(\text{object}) \cdot \text{IOU}^{\text{truth}}_{\text{pred}} }[/math]

Here IOU is the intersection over union, also called the Jaccard index^[5]. It is an evaluation metric that rewards bounding boxes which significantly overlap with the ground-truth bounding boxes of labelled objects in the input.

Each grid cell must also predict [math]\displaystyle{ C }[/math] class probabilities [math]\displaystyle{ P(C_i | \text{object}) }[/math]. The set of class probabilities is only predicted once for each grid cell, irrespective of the number of boxes [math]\displaystyle{ B }[/math]. Thus combining the grid cell division with the bounding box and class probability predictions, we end up with a tensor output of the shape [math]\displaystyle{ S \times S \times (B \cdot 5 + C) }[/math].

Neural network architecture

The network is structured quite conventionally, with convolutional and max pooling layers to perform feature extraction, along with some convolutional layers and 2 fully connected layers at the end which predict the bounding boxes along with class probabilities.


layer	filters	stride	out dimension
Conv 1	7 x 7 x 64	2	224 x 224 x 64
Max Pool 1	2 x 2	2	112 x 112 x 64
Conv 2	3x3x192	1	112 x 112 x 192
Max Pool 2	2 x 2	2	56 x 56 x 192
Conv 3	1 x 1 x 128	1	56 x 56 x 128
Conv 4	3 x 3 x 256	1	56 x 56 x 256
Conv 5	1 x 1 x 256	1	56 x 56 x 256
Conv 6	1 x 1 x 512	1	56 x 56 x 512
Max Pool 3	2 x 2	2	28 x 28 x 512
Conv 7	1 x 1 x 256	1	28 x 28 x 256
Conv 8	3 x 3 x 512	1	28 x 28 x 512
Conv 9	1 x 1 x 256	1	28 x 28 x 256
Conv 10	3 x 3 x 512	1	28 x 28 x 512
Conv 11	1 x 1 x 256	1	28 x 28 x 256
Conv 12	3 x 3 x 512	1	28 x 28 x 512
Conv 13	1 x 1 x 256	1	28 x 28 x 256
Conv 14	3 x 3 x 512	1	28 x 28 x 512
Conv 15	1 x 1 x 512	1	28 x 28 x 512
Conv 16	3 x 3 x 1024	1	28 x 28 x 1024
Max Pool 4	2 x 2	2	14 x 14 x 1024
Conv 17	1 x 1 x 512	1	14 x 14 x 512
Conv 18	3 x 3 x 1024	1	14 x 14 x 1024
Conv 19	1 x 1 x 512	1	14 x 14 x 512
Conv 20	3 x 3 x 1024	1	14 x 14 x 1024
Conv 21	3 x 3 x 1024	1	14 x 14 x 1024
Conv 22	3 x 3 x 1024	2	7 x 7 x 1024
Conv 23	3 x 3 x 1024	1	7 x 7 x 1024
Conv 24	3 x 3 x 1024	1	7 x 7 x 1024
Fully Connected 1	-	-	4096
Fully Connected 2	-	-	7 x 7 x 30

Network training details

Activation functions

A linear activation function is used for the final layer which predicts class probabilities and bounding boxes, while a leaky ReLu defined as follows is used for all other layers:

[math]\displaystyle{ \phi(x) = x \cdot \mathbf{1}_{\{x \gt 0\}} + 0.1x \cdot \mathbf{1}_{\{x \leq 0\}} }[/math]

Layers for feature extraction and detection

The author elects to pretrain the first 20 convolutional layers followed by an average-pooling layer and a fully connected layer with a large dataset, which in the paper's case was ImageNet's 1000-class competition dataset ^[6]. Afterwards, 4 convolutional layers and two fully connected layers with randomly initialized weights are used to perform detection, as shown to be beneficial in Ren et al^[6].

Loss function

The entire loss function which YOLO optimizes for is defined as follows:

[math]\displaystyle{ \begin{align*} L &= \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 \right] \\ & \ \ \ \ \ + \lambda_{\text{coord}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ \left( \sqrt{w_i} - \sqrt{\hat{w}_i} \right)^2 + \left( \sqrt{h_i} - \sqrt{\hat{h}_i}\right)^2 \right] \\ & \ \ \ \ \ + \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2 \\ & \ \ \ \ \ + \lambda_{\text{noobj}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2 \\ & \ \ \ \ \ + \sum_{i=0}^{S^2} \mathbf{1}_{i}^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2, \end{align*} }[/math]

where:

[math]\displaystyle{ \mathbf{1}_{ij}^{\text{obj}} }[/math] is an indicator for the [math]\displaystyle{ j }[/math]-th bounding box in the [math]\displaystyle{ i }[/math]-th grid cell being responsible for the given prediction
[math]\displaystyle{ \lambda_{\text{coord}} }[/math] is a hyperparameter designed to increase the loss stemming from bounding box coordinate predictions, set in the paper to be [math]\displaystyle{ 5 }[/math]
[math]\displaystyle{ \lambda_{\text{noobj}} }[/math] is a hyperparameter designed to decrease the loss stemming from confidence predictions for boxes that do not contain objects, set in the paper to be [math]\displaystyle{ 0.5 }[/math]

This loss function is evidently complex, so it is easier to understand in parts:

foo

References

^[1]P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.

^[2]R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.

^[3]R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.

^[4]S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.

^[5]P. Jaccard. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37: 547–579, 1901.

^[6]O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.

^[7]S. Ren, K. He, R. B. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. CoRR, abs/1504.06066, 2015.

stat441F18/YOLO

Contents

Presented by

Encoding predictions

Neural network architecture

Network training details

Activation functions

Layers for feature extraction and detection

Loss function

References

Navigation menu

stat441F18/YOLO

Presented by

Encoding predictions

Neural network architecture

Network training details

Activation functions

Layers for feature extraction and detection

Loss function

References

Navigation menu

Search