Difference between revisions of "One-Shot Object Detection with Co-Attention and Co-Excitation"

From statwiki
Jump to: navigation, search
Line 11: Line 11:
 
Figure 1 shows an example where the model identifies and locates all the instances of different objects present in the image successfully. It encloses each object within a bounding box and annotates each box with the class of the object present inside the box.
 
Figure 1 shows an example where the model identifies and locates all the instances of different objects present in the image successfully. It encloses each object within a bounding box and annotates each box with the class of the object present inside the box.
  
State-of-the-art object detectors are trained on thousands of images for different classes before the model can accurately predict the class and spatial location for unseen images belonging to the classes the model has been trained on. When a model is trained with K labeled instances for each of N classes, then this setting is known as N-way K-shot classification. K = 0 for zero-shot learning, K = 1 for one-shot learning and k > 1 for few shot learning.
+
State-of-the-art object detectors are trained on thousands of images for different classes before the model can accurately predict the class and spatial location for unseen images belonging to the classes the model has been trained on. When a model is trained with K labeled instances for each of N classes, then this setting is known as N-way K-shot classification. K = 0 for zero-shot learning, K = 1 for one-shot learning and k > 1 for few-shot learning.
  
 
== Introduction ==
 
== Introduction ==
  
The problem this paper is trying to tackle is given a query image p, the model needs to find all the instances in the target image of the object present in the query image. Consider the same task when given to a human, i.e. the task of identifying and locating the instances of never-before-seen object in the target image based on the query image, the person will try to compare different characteristics of the object like shape, texture, color, etc. along with applying attention for localization. The human visual system can achieve this even in varying conditions like lighting conditions, viewing angles, etc. The authors are trying to incorporate the same functionality into the model to achieve this task. The target and query image do not need to be exactly the same and are allowed to have variations as long as they share some attributes so that they can belong to the same category.
+
The problem this paper is trying to tackle is given a query image p, the model needs to find all the instances in the target image of the object present in the query image. Consider the same task when given to a human, i.e. the task of identifying and locating the instances of a never-before-seen object in the target image based on the query image, the person will try to compare different characteristics of the object like shape, texture, color, etc. along with applying attention for localization. The human visual system can achieve this even in varying conditions like lighting conditions, viewing angles, etc. The authors are trying to incorporate the same functionality into the model to achieve this task. The target and query image do not need to be exactly the same and are allowed to have variations as long as they share some attributes so that they can belong to the same category.
  
In this paper, the authors have made contributions to three technical areas. First is the use of non-local operations to generate better region proposals for the target image based on the query image. This operation can be thought of as a co-attention mechanism. Second contribution is proposing a Squeeze and Co-Excitation mechanism to identify and give more importance to relevant features to filter out relevant proposals and hence the instances in the target image. Third, the authors designed a margin-based ranking loss which will be useful for predicting the similarity of region proposals with the given query image irrespective of whether the label of the class is seen or unseen during the training process.
+
In this paper, the authors have made contributions to three technical areas. First is the use of non-local operations to generate better region proposals for the target image based on the query image. This operation can be thought of as a co-attention mechanism. The second contribution is proposing a Squeeze and Co-Excitation mechanism to identify and give more importance to relevant features to filter out relevant proposals and hence the instances in the target image. Third, the authors designed a margin-based ranking loss which will be useful for predicting the similarity of region proposals with the given query image irrespective of whether the label of the class is seen or unseen during the training process.
  
 
== Previous Work ==
 
== Previous Work ==
Line 23: Line 23:
 
All state-of-the-art object detectors are variants of deep convolutional neural networks. There are two types of object detectors:
 
All state-of-the-art object detectors are variants of deep convolutional neural networks. There are two types of object detectors:
  
1) Two Stage Object Detectors: These types of detectors generate region proposals in the first stage whereas classify and refine the proposals in the second stage. Eg. FasterRCNN[1].
+
1) Two-Stage Object Detectors: These types of detectors generate region proposals in the first stage whereas classifying and refining the proposals in the second stage. Eg. FasterRCNN[1].
  
 
2) One Stage Object Detectors: These types of detectors directly predict bounding boxes and their corresponding labels based on a fixed set of anchors. Eg. CornerNet[2].
 
2) One Stage Object Detectors: These types of detectors directly predict bounding boxes and their corresponding labels based on a fixed set of anchors. Eg. CornerNet[2].
  
  
There are some of the approaches that have been proposed to tackle the problem of few-shot object detection. These approaches are based on transfer learning[3], meta-learning[4] and metric-learning[5].
+
There are some of the approaches that have been proposed to tackle the problem of few-shot object detection. These approaches are based on transfer learning[3], meta-learning[4], and metric-learning[5].
  
 
1) Transfer Learning: Chen et al.[3] proposed a regularization technique to reduce overfitting when the model is trained on just a few instances for each class belonging to unseen classes.
 
1) Transfer Learning: Chen et al.[3] proposed a regularization technique to reduce overfitting when the model is trained on just a few instances for each class belonging to unseen classes.
Line 43: Line 43:
  
 
where C_0 represents the classes that the model is trained on and C_1 represents the classes on which the inference is done.
 
where C_0 represents the classes that the model is trained on and C_1 represents the classes on which the inference is done.
 +
 +
Let's redefine the problem statement. Given a query image belonging to a class in set C_1, the task is to predict all the instances of that object in the target image. The architecture of this model is based on FasterRCNN[1] and ResNet-50 has been used as the backbone for extracting features from the images. To tackle this problem, the authors have proposed the following techniques: Non-Local object proposals, Squeeze and Co-excitation mechanism and margin-based ranking loss.

Revision as of 12:00, 21 November 2020

Presented By

Gautam Bathla

Background

Object Detection is a technique where the model gets an image as an input and outputs the class and location of all the objects present in the image.

Image: 500 pixels
Figure 1: Object Detection on an image

Figure 1 shows an example where the model identifies and locates all the instances of different objects present in the image successfully. It encloses each object within a bounding box and annotates each box with the class of the object present inside the box.

State-of-the-art object detectors are trained on thousands of images for different classes before the model can accurately predict the class and spatial location for unseen images belonging to the classes the model has been trained on. When a model is trained with K labeled instances for each of N classes, then this setting is known as N-way K-shot classification. K = 0 for zero-shot learning, K = 1 for one-shot learning and k > 1 for few-shot learning.

Introduction

The problem this paper is trying to tackle is given a query image p, the model needs to find all the instances in the target image of the object present in the query image. Consider the same task when given to a human, i.e. the task of identifying and locating the instances of a never-before-seen object in the target image based on the query image, the person will try to compare different characteristics of the object like shape, texture, color, etc. along with applying attention for localization. The human visual system can achieve this even in varying conditions like lighting conditions, viewing angles, etc. The authors are trying to incorporate the same functionality into the model to achieve this task. The target and query image do not need to be exactly the same and are allowed to have variations as long as they share some attributes so that they can belong to the same category.

In this paper, the authors have made contributions to three technical areas. First is the use of non-local operations to generate better region proposals for the target image based on the query image. This operation can be thought of as a co-attention mechanism. The second contribution is proposing a Squeeze and Co-Excitation mechanism to identify and give more importance to relevant features to filter out relevant proposals and hence the instances in the target image. Third, the authors designed a margin-based ranking loss which will be useful for predicting the similarity of region proposals with the given query image irrespective of whether the label of the class is seen or unseen during the training process.

Previous Work

All state-of-the-art object detectors are variants of deep convolutional neural networks. There are two types of object detectors:

1) Two-Stage Object Detectors: These types of detectors generate region proposals in the first stage whereas classifying and refining the proposals in the second stage. Eg. FasterRCNN[1].

2) One Stage Object Detectors: These types of detectors directly predict bounding boxes and their corresponding labels based on a fixed set of anchors. Eg. CornerNet[2].


There are some of the approaches that have been proposed to tackle the problem of few-shot object detection. These approaches are based on transfer learning[3], meta-learning[4], and metric-learning[5].

1) Transfer Learning: Chen et al.[3] proposed a regularization technique to reduce overfitting when the model is trained on just a few instances for each class belonging to unseen classes.

2) Meta-Learning: Kang et al.[4] trained a meta-model to re-weight the learned weights of an image extracted from the base model.

3) Metric-Learning: These frameworks replace the conventional classifier layer with the metric-based classifier layer.

Approach

Let's define some notations before diving into the approach of this paper. Let 'C' be the set of classes for this object detection task. Since one-shot object detection task needs unseen classes during inference time, therefore we divide the set of classes into two categories as follows:

[math] C = C_0 \bigcup C_1,[/math]

where C_0 represents the classes that the model is trained on and C_1 represents the classes on which the inference is done.

Let's redefine the problem statement. Given a query image belonging to a class in set C_1, the task is to predict all the instances of that object in the target image. The architecture of this model is based on FasterRCNN[1] and ResNet-50 has been used as the backbone for extracting features from the images. To tackle this problem, the authors have proposed the following techniques: Non-Local object proposals, Squeeze and Co-excitation mechanism and margin-based ranking loss.