Difference between revisions of "Annotating Object Instances with a Polygon RNN"

From statwiki
Jump to: navigation, search
(Primary write up - Into to Architecture.)
m (Added Figure 2)
Line 35: Line 35:
=== Architecture ===
=== Architecture ===
[[File:Figure_2.jpeg | 450px|thumb|center|Figure 1: Different levels of detection in an image.]]

Revision as of 22:08, 1 November 2018

Summary of the CVPR '17 best paper


If a snapshot of an image is given to a human, how will he/she describe a scene? He/she might identify that there is a car parked near the curb, or that the car is parked right besides a street light. This ability to decompose objects in scenes into separate entities is key to understanding what is around us and it helps reason about the behavior of objects in the scene.

Automating this process is a classic computer vision problem and is often termed "object detection." The term object detection has been used interchangeably, however, there are four distinct levels of detection (refer to Figure 1 for a visual cue):

1. Classification + Localization: This is the most basic method that detects whether an object is either present of absent in the image and then identifies the position of the object within the image in the form of a bounding box overlayed on the image.

2. Object Detection: The classic definition of object detection points to the detection and localization of multiple objects of interest in the image. The output of the detection is still a bounding box overlayed on the image at the position corresponding the location of the objects in the image.

3. Semantic Segmentation: This is a pixel level approach, i.e., each pixel in the image is assigned to a category label. Here, there is no difference between instances; this is to say that there are objects present from three distinct categories in the image, without tracking or reporting the number of appearances of each instance within a category.

4. Instance Segmentation: The goal, here, is to not only to assign pixel-level categorical labels, but also to identify each entity separately as sheep 1, sheep 2, sheep 3, grass, and so on.

Figure 1: Different levels of detection in an image.


Semantic segmentation helps us achieve a deeper understanding of images than image classification or object detection. Over and above this, instance segmentation is crucial in in applications where multiple objects of the same category are to be tracked, especially in autonomous driving, mobile robotics, and medical image processing. This paper deals with a novel method to tackle the instance segmentation problem pertaining specifically to the field of autonomous driving, but shown to generalize well in other fields such as medical image processing.

Most of the recent approaches to on instance segmentation are based on deep neural networks and have demonstrated impressive performance. Given that these approaches require a lot of computational resources and that their performance depends on the amount of accessible training data, there has been an increase in the demand to label/annotate large scale datasets. This is both expensive and time consuming. Thus, the main goal of the paper is to enable semi-automatic annotation of object instances.

Most of the datasets available pass through a stage where annotators manually outline the objects with a closed polygon. Polygons allow annotation of objects with small number of clicks (30 - 40) compared to other methods; this approach works as silhouette of an object is typically connected without holes. Thus, the authors intuition behind the success of this method is the sparse nature of these polygons that allow representation of an object through a cluster of pixels rather than a pixel level description.


As an input to the the model, an annotator or perhaps another neural network provides a ground-truth bounding box containing an object of interest and the model auto-generates a polygon outlining the object instance using a Recurrent Neural Network which they call: Polygon-RNN.

The RNN model predicts they vertices of the polygon at each time step given a CNN representation of the image, the last two time steps, and the first vertex location. The location of the first vertex is defined differently will be defined shortly. The information regarding the previous two time steps helps the RNN create a polygon in a specific direction and the first vertex provides a cue for loop closure of the polygon edges.

The polygon is parametrized as a sequence of 2D vertices and it is assumed that the polygon is closed. In addition, the polygon generation is fixed to follow a clockwise orientation since there are multiple ways to create a polygon given that it is cyclic structure. However, the starting point of the sequence is defined so that it can be any of the vertices of the polygon.


File:Figure 2.jpeg
Figure 1: Different levels of detection in an image.