stat441F18/YOLO
The You Only Look Once (YOLO) object detection model is a one-shot object detection network aimed to combine suitable accuracy with extreme performance. Unlike other popular approaches to object detection such as sliding window DPM [1] or regional proposal models such as the R-CNN variants [2][3][4], YOLO does not comprise of a multi-step pipeline which can become difficult to optimize. Instead it frames object detection as a single regression problem, and uses a single convolutional neural network to predict bounding boxes and class probabilities for each box.
Some of the cited benefits of the YOLO model are:
- Extreme speed. This is attributable to the framing of detection as a single regression problem.
- Learning of generalizable object representations. YOLO is experimentally found to degrade in average precision at a slower rate than comparable methods such as R-CNN when exposed to datasets with significantly varying pixel-level test data, such as the Picasso and People-Art datasets.
- Global reasoning during prediction. Since YOLO sees whole images during training, unlike other techniques which only see a subsection of the image, its neural network implicitly encodes global contextual information about classes in addition to their local properties. As a result, YOLO makes less misclassification errors on image backgrounds than other methods which cannot benefit from this global context.
Presented by
- Mitchell Snaith | msnaith@edu.uwaterloo.ca
Encoding predictions
The most important part of the YOLO model is its novel approach to prediction encoding.
The input image is first divided into an [math]\displaystyle{ S \times S }[/math] grid of cells. Now for the given bounding boxes of objects in the input image, the YOLO model takes the central point of the bounding box and links it to the grid cell in which it is contained. This cell will be responsible for the detection of that object.
Now for each grid cell, YOLO predicts [math]\displaystyle{ B }[/math] bounding boxes along with confidence scores. These bounding boxes are predicted directly with fully connected layers at the end of the single neural network, as seen later in the network architecture. Each bounding box is comprised of 5 predictions:
[math]\displaystyle{ \begin{align*} &\left. \begin{aligned} x \\ y \end{aligned} \right\rbrace \text{the center of the bounding box relative to the grid cell} \\ &\left. \begin{aligned} w \\ h \end{aligned} \right\rbrace \text{the width and height of the bounding box relative to the whole input} \\ &\left. \begin{aligned} p_c \end{aligned} \right\rbrace \text{the confidence of presence of an object of any class} \end{align*} }[/math]
[math]\displaystyle{ (x, y) }[/math] and [math]\displaystyle{ (w, h) }[/math] are normalized to the range [math]\displaystyle{ (0, 1) }[/math]. Further, [math]\displaystyle{ p_c }[/math] in this context is defined as follows:
[math]\displaystyle{ p_c = P(\text{object}) \cdot \text{IOU}^{\text{truth}}_{\text{pred}} }[/math]
Here IOU is the intersection over union, also called the Jaccard index[5]. It is an evaluation metric that rewards bounding boxes which significantly overlap with the ground-truth bounding boxes of labelled objects in the input.
Each grid cell must also predict [math]\displaystyle{ C }[/math] class probabilities [math]\displaystyle{ P(C_i | \text{object}) }[/math]. The set of class probabilities is only predicted once for each grid cell, irrespective of the number of boxes [math]\displaystyle{ B }[/math]. Thus combining the grid cell division with the bounding box and class probability predictions, we end up with a tensor output of the shape [math]\displaystyle{ S \times S \times (B \cdot 5 + C) }[/math].
Neural network architecture
The network is structured quite conventionally, with convolutional and max pooling layers to perform feature extraction, along with some convolutional layers and 2 fully connected layers at the end which predict the bounding boxes along with class probabilities.
layer | filters | stride | out dimension |
---|---|---|---|
Conv 1 | 7 x 7 x 64 | 2 | 224 x 224 x 64 |
Max Pool 1 | 2 x 2 | 2 | 112 x 112 x 64 |
Conv 2 | 3x3x192 | 1 | 112 x 112 x 192 |
Max Pool 2 | 2 x 2 | 2 | 56 x 56 x 192 |
Conv 3 | 1 x 1 x 128 | 1 | 56 x 56 x 128 |
Conv 4 | 3 x 3 x 256 | 1 | 56 x 56 x 256 |
Conv 5 | 1 x 1 x 256 | 1 | 56 x 56 x 256 |
Conv 6 | 1 x 1 x 512 | 1 | 56 x 56 x 512 |
Max Pool 3 | 2 x 2 | 2 | 28 x 28 x 512 |
Conv 7 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 8 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 9 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 10 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 11 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 12 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 13 | 1 x 1 x 256 | 1 | 28 x 28 x 256 |
Conv 14 | 3 x 3 x 512 | 1 | 28 x 28 x 512 |
Conv 15 | 1 x 1 x 512 | 1 | 28 x 28 x 512 |
Conv 16 | 3 x 3 x 1024 | 1 | 28 x 28 x 1024 |
Max Pool 4 | 2 x 2 | 2 | 14 x 14 x 1024 |
Conv 17 | 1 x 1 x 512 | 1 | 14 x 14 x 512 |
Conv 18 | 3 x 3 x 1024 | 1 | 14 x 14 x 1024 |
Conv 19 | 1 x 1 x 512 | 1 | 14 x 14 x 512 |
Conv 20 | 3 x 3 x 1024 | 1 | 14 x 14 x 1024 |
Conv 21 | 3 x 3 x 1024 | 1 | 14 x 14 x 1024 |
Conv 22 | 3 x 3 x 1024 | 2 | 7 x 7 x 1024 |
Conv 23 | 3 x 3 x 1024 | 1 | 7 x 7 x 1024 |
Conv 24 | 3 x 3 x 1024 | 1 | 7 x 7 x 1024 |
Fully Connected 1 | - | - | 4096 |
Fully Connected 2 | - | - | 7 x 7 x 30 |
Network training details
Activation functions
A linear activation function is used for the final layer which predicts class probabilities and bounding boxes, while a leaky ReLu defined as follows is used for all other layers:
[math]\displaystyle{ \phi(x) = x \cdot \mathbf{1}_{\{x \gt 0\}} + 0.1x \cdot \mathbf{1}_{\{x \leq 0\}} }[/math]
Layers for feature extraction and detection
The author elects to pretrain the first 20 convolutional layers followed by an average-pooling layer and a fully connected layer with a large dataset, which in the paper's case was ImageNet's 1000-class competition dataset [6]. Afterwards, 4 convolutional layers and two fully connected layers with randomly initialized weights are used to perform detection, as shown to be beneficial in Ren et al[6].
Loss function
The entire loss function which YOLO optimizes for is defined as follows:
[math]\displaystyle{ \begin{align*} L &= \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 \right] \\ & \ \ \ \ \ + \lambda_{\text{coord}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} \left[ \left( \sqrt{w_i} - \sqrt{\hat{w}_i} \right)^2 + \left( \sqrt{h_i} - \sqrt{\hat{h}_i}\right)^2 \right] \\ & \ \ \ \ \ + \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2 \\ & \ \ \ \ \ + \lambda_{\text{noobj}} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbf{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2 \\ & \ \ \ \ \ + \sum_{i=0}^{S^2} \mathbf{1}_{i}^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2, \end{align*} }[/math]
where:
- [math]\displaystyle{ \mathbf{1}_{ij}^{\text{obj}} }[/math] is an indicator for the [math]\displaystyle{ j }[/math]-th bounding box in the [math]\displaystyle{ i }[/math]-th grid cell being responsible for the given prediction
- [math]\displaystyle{ \lambda_{\text{coord}} }[/math] is a hyperparameter designed to increase the loss stemming from bounding box coordinate predictions, set in the paper to be [math]\displaystyle{ 5 }[/math]
- [math]\displaystyle{ \lambda_{\text{noobj}} }[/math] is a hyperparameter designed to decrease the loss stemming from confidence predictions for boxes that do not contain objects, set in the paper to be [math]\displaystyle{ 0.5 }[/math]
This loss function is evidently complex, so it is easier to understand in parts:
- foo
References
- [1]P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
- [2]R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.
- [3]R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
- [4]S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.
- [5]P. Jaccard. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37: 547–579, 1901.
- [6]O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
- [7]S. Ren, K. He, R. B. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. CoRR, abs/1504.06066, 2015.