stat441F18/YOLO: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 160: Line 160:
where:
where:


*<math>\mathbf{1}_{i}^{\text{obj}}</math> is an indicator for some bounding box in the  <math>i</math>-th grid cell being responsible for the given prediction
*<math>\mathbf{1}_{ij}^{\text{obj}}</math> is an indicator for the <math>j</math>-th bounding box in the <math>i</math>-th grid cell being responsible for the given prediction
*<math>\mathbf{1}_{ij}^{\text{obj}}</math> is an indicator for the <math>j</math>-th bounding box in the <math>i</math>-th grid cell being responsible for the given prediction
*<math>\mathbf{1}_{ij}^{\text{noobj}}</math> is an indicator for the <math>j</math>-th bounding box in the <math>i</math>-th grid cell not being responsible for the given prediction
*<math>\lambda_{\text{coord}}</math> is a hyperparameter designed to increase the loss stemming from bounding box coordinate predictions, set in the paper to be <math>5</math>
*<math>\lambda_{\text{coord}}</math> is a hyperparameter designed to increase the loss stemming from bounding box coordinate predictions, set in the paper to be <math>5</math>
*<math>\lambda_{\text{noobj}}</math> is a hyperparameter designed to decrease the loss stemming from confidence predictions for boxes that do not contain objects, set in the paper to be <math>0.5</math>
*<math>\lambda_{\text{noobj}}</math> is a hyperparameter designed to decrease the loss stemming from confidence predictions for boxes that do not contain objects, set in the paper to be <math>0.5</math>
*<math>(\hat{x}, \hat{y})</math> are actual positional coordinates for the given data
*<math>(\hat{w}, \hat{h})</math> are actual dimensional coordinates for the given data
*<math>\hat{C}</math> is the IOU of the predicted bounding box with the ground-truth bounding box.


This loss function is evidently complex, so it is easier to understand in parts:
This loss function appears quite complex, and may be easier to understand in parts.
 
====The bounding box position loss====
<center>
<math>
\lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 \right]
</math>
 
</center>
Notice that it is the sum-of-squared error which is being optimized for in this term. Sum-of-squared error was selected as the underlying loss function for the model due to its simplicity in optimization.
 
====The bounding box dimension loss====
<center>
<math>
\lambda_{\text{coord}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ \left( \sqrt{w_i} - \sqrt{\hat{w}_i} \right)^2 + \left( \sqrt{h_i} - \sqrt{\hat{h}_i}\right)^2 \right]
</math>
</center>
 
The sum-of-squared error is used here as well, with a slight adjustment - taking square roots of each coordinate. This is done to ensure that the error metric properly manages scales of deviations. That is, small deviations in large boxes should be of less significance than small deviations in small boxes. The author claims that predicting the square root of the bounding box dimensions partially addresses this issue. 
 
====The bounding box predictor loss====
<center>
<math>
\sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2
+ \lambda_{\text{noobj}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2
</math>
</center>
 
This is the loss associated with the confidence score for each bounding box predictor, again using sum-of-squared error.
 
====The classification loss====
<center>
<math>
\sum_{i=0}^{S^2} \mathbf{1}_{i}^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2
</math>
</center>
 
This is synonymous with the regular sum-of-squared error for classification, however we included the indicator <math>\mathbf{1}_{i}^{\text{obj}}</math> to ensure that classification error is not penalized for cells which do not contain objects.


# foo





Revision as of 23:03, 19 November 2018

The You Only Look Once (YOLO) object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM [1] or regional proposal models such as the R-CNN variants [2][3][4], YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box.

Some of the cited benefits of the YOLO model are:

  1. Extreme speed. This is attributable to the framing of detection as a single regression problem.
  2. Learning of generalizable object representations. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets.
  3. Global reasoning during prediction. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context.

Presented by

  • Mitchell Snaith | msnaith@edu.uwaterloo.ca

Encoding predictions

The most important part of the YOLO model is its novel approach to prediction encoding.

The input image is first divided into an [math]\displaystyle{ S \times S }[/math] grid of cells. Now for the given bounding boxes of objects in the input image, the YOLO model takes the central point of the bounding box and links it to the grid cell in which it is contained. This cell will be responsible for the detection of that object.

Now for each grid cell, YOLO predicts [math]\displaystyle{ B }[/math] bounding boxes along with confidence scores. These bounding boxes are predicted directly with fully connected layers at the end of the single neural network, as seen later in the network architecture. Each bounding box is comprised of 5 predictions:

[math]\displaystyle{ \begin{align*} &\left. \begin{aligned} x \\ y \end{aligned} \right\rbrace \text{the center of the bounding box relative to the grid cell} \\ &\left. \begin{aligned} w \\ h \end{aligned} \right\rbrace \text{the width and height of the bounding box relative to the whole input} \\ &\left. \begin{aligned} p_c \end{aligned} \right\rbrace \text{the confidence of presence of an object of any class} \end{align*} }[/math]


[math]\displaystyle{ (x, y) }[/math] and [math]\displaystyle{ (w, h) }[/math] are normalized to the range [math]\displaystyle{ (0, 1) }[/math]. Further, [math]\displaystyle{ p_c }[/math] in this context is defined as follows:

[math]\displaystyle{ p_c = P(\text{object}) \cdot \text{IOU}^{\text{truth}}_{\text{pred}} }[/math]

Here IOU is the intersection over union, also called the Jaccard index[5]. It is an evaluation metric that rewards bounding boxes which significantly overlap with the ground-truth bounding boxes of labelled objects in the input.

Each grid cell must also predict [math]\displaystyle{ C }[/math] class probabilities [math]\displaystyle{ P(C_i | \text{object}) }[/math]. The set of class probabilities is only predicted once for each grid cell, irrespective of the number of boxes [math]\displaystyle{ B }[/math]. Thus combining the grid cell division with the bounding box and class probability predictions, we end up with a tensor output of the shape [math]\displaystyle{ S \times S \times (B \cdot 5 + C) }[/math].

Neural network architecture

The network is structured quite conventionally, with convolutional and max pooling layers to perform feature extraction, along with some convolutional layers and 2 fully connected layers at the end which predict the bounding boxes along with class probabilities.


layer filters stride out dimension
Conv 1 7 x 7 x 64 2 224 x 224 x 64
Max Pool 1 2 x 2 2 112 x 112 x 64
Conv 2 3x3x192 1 112 x 112 x 192
Max Pool 2 2 x 2 2 56 x 56 x 192
Conv 3 1 x 1 x 128 1 56 x 56 x 128
Conv 4 3 x 3 x 256 1 56 x 56 x 256
Conv 5 1 x 1 x 256 1 56 x 56 x 256
Conv 6 1 x 1 x 512 1 56 x 56 x 512
Max Pool 3 2 x 2 2 28 x 28 x 512
Conv 7 1 x 1 x 256 1 28 x 28 x 256
Conv 8 3 x 3 x 512 1 28 x 28 x 512
Conv 9 1 x 1 x 256 1 28 x 28 x 256
Conv 10 3 x 3 x 512 1 28 x 28 x 512
Conv 11 1 x 1 x 256 1 28 x 28 x 256
Conv 12 3 x 3 x 512 1 28 x 28 x 512
Conv 13 1 x 1 x 256 1 28 x 28 x 256
Conv 14 3 x 3 x 512 1 28 x 28 x 512
Conv 15 1 x 1 x 512 1 28 x 28 x 512
Conv 16 3 x 3 x 1024 1 28 x 28 x 1024
Max Pool 4 2 x 2 2 14 x 14 x 1024
Conv 17 1 x 1 x 512 1 14 x 14 x 512
Conv 18 3 x 3 x 1024 1 14 x 14 x 1024
Conv 19 1 x 1 x 512 1 14 x 14 x 512
Conv 20 3 x 3 x 1024 1 14 x 14 x 1024
Conv 21 3 x 3 x 1024 1 14 x 14 x 1024
Conv 22 3 x 3 x 1024 2 7 x 7 x 1024
Conv 23 3 x 3 x 1024 1 7 x 7 x 1024
Conv 24 3 x 3 x 1024 1 7 x 7 x 1024
Fully Connected 1 - - 4096
Fully Connected 2 - - 7 x 7 x 30

Network training details

Activation functions

A linear activation function is used for the final layer which predicts class probabilities and bounding boxes, while a leaky ReLu defined as follows is used for all other layers:

[math]\displaystyle{ \phi(x) = x \cdot \mathbf{1}_{\{x \gt 0\}} + 0.1x \cdot \mathbf{1}_{\{x \leq 0\}} }[/math]

Layers for feature extraction and detection

The author elects to pretrain the first 20 convolutional layers followed by an average-pooling layer and a fully connected layer with a large dataset, which in the paper's case was ImageNet's 1000-class competition dataset [6]. Afterwards, 4 convolutional layers and two fully connected layers with randomly initialized weights are used to perform detection, as shown to be beneficial in Ren et al[6].

Loss function

The entire loss function which YOLO optimizes for is defined as follows:

[math]\displaystyle{ \begin{align*} L &= \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 \right] \\ & \ \ \ \ \ + \lambda_{\text{coord}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ \left( \sqrt{w_i} - \sqrt{\hat{w}_i} \right)^2 + \left( \sqrt{h_i} - \sqrt{\hat{h}_i}\right)^2 \right] \\ & \ \ \ \ \ + \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2 \\ & \ \ \ \ \ + \lambda_{\text{noobj}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2 \\ & \ \ \ \ \ + \sum_{i=0}^{S^2} \mathbf{1}_{i}^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2, \end{align*} }[/math]

where:

  • [math]\displaystyle{ \mathbf{1}_{i}^{\text{obj}} }[/math] is an indicator for some bounding box in the [math]\displaystyle{ i }[/math]-th grid cell being responsible for the given prediction
  • [math]\displaystyle{ \mathbf{1}_{ij}^{\text{obj}} }[/math] is an indicator for the [math]\displaystyle{ j }[/math]-th bounding box in the [math]\displaystyle{ i }[/math]-th grid cell being responsible for the given prediction
  • [math]\displaystyle{ \mathbf{1}_{ij}^{\text{noobj}} }[/math] is an indicator for the [math]\displaystyle{ j }[/math]-th bounding box in the [math]\displaystyle{ i }[/math]-th grid cell not being responsible for the given prediction
  • [math]\displaystyle{ \lambda_{\text{coord}} }[/math] is a hyperparameter designed to increase the loss stemming from bounding box coordinate predictions, set in the paper to be [math]\displaystyle{ 5 }[/math]
  • [math]\displaystyle{ \lambda_{\text{noobj}} }[/math] is a hyperparameter designed to decrease the loss stemming from confidence predictions for boxes that do not contain objects, set in the paper to be [math]\displaystyle{ 0.5 }[/math]
  • [math]\displaystyle{ (\hat{x}, \hat{y}) }[/math] are actual positional coordinates for the given data
  • [math]\displaystyle{ (\hat{w}, \hat{h}) }[/math] are actual dimensional coordinates for the given data
  • [math]\displaystyle{ \hat{C} }[/math] is the IOU of the predicted bounding box with the ground-truth bounding box.

This loss function appears quite complex, and may be easier to understand in parts.

The bounding box position loss

[math]\displaystyle{ \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 \right] }[/math]

Notice that it is the sum-of-squared error which is being optimized for in this term. Sum-of-squared error was selected as the underlying loss function for the model due to its simplicity in optimization.

The bounding box dimension loss

[math]\displaystyle{ \lambda_{\text{coord}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ \left( \sqrt{w_i} - \sqrt{\hat{w}_i} \right)^2 + \left( \sqrt{h_i} - \sqrt{\hat{h}_i}\right)^2 \right] }[/math]

The sum-of-squared error is used here as well, with a slight adjustment - taking square roots of each coordinate. This is done to ensure that the error metric properly manages scales of deviations. That is, small deviations in large boxes should be of less significance than small deviations in small boxes. The author claims that predicting the square root of the bounding box dimensions partially addresses this issue.

The bounding box predictor loss

[math]\displaystyle{ \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2 + \lambda_{\text{noobj}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2 }[/math]

This is the loss associated with the confidence score for each bounding box predictor, again using sum-of-squared error.

The classification loss

[math]\displaystyle{ \sum_{i=0}^{S^2} \mathbf{1}_{i}^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2 }[/math]

This is synonymous with the regular sum-of-squared error for classification, however we included the indicator [math]\displaystyle{ \mathbf{1}_{i}^{\text{obj}} }[/math] to ensure that classification error is not penalized for cells which do not contain objects.



References

  • [1]P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
  • [2]R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.
  • [3]R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
  • [4]S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.
  • [5]P. Jaccard. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37: 547–579, 1901.
  • [6]O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
  • [7]S. Ren, K. He, R. B. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. CoRR, abs/1504.06066, 2015.