Learning to Teach

From statwiki
Revision as of 15:45, 31 October 2018 by R9feng (talk | contribs)
Jump to navigation Jump to search

Introduction

Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame. The process normally consists of the following steps.

  1. Taking an initial set of object detections.
  2. Creating and assigning a unique ID for each of the initial detections.
  3. Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs.

There are two types of object tracking.

  1. Passive tracking
  2. Active tracking

Passive tracking assumes that the object of interest is always in the image scene, meaning that there is no need for camera control during tracking. Although passive tracking is very useful and well-researched with existing works, it is not applicable in situations like tracking performed by a camera-mounted mobile robot or by a drone. On the other hand, active tracking involves two subtasks, including 1) Object Tracking and 2) Camera Control. It is difficult to jointly tune the pipeline between these two separate subtasks. Object Tracking may require human efforts for bounding box labeling. In addition, Camera Control is non-trivial, which can lead to many expensive trial-and-errors in the real world.

Intuition

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking submodules. Now we can train the entire system together as the error needs to be propagated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in the case of DQN). The training of these CNN happens concurrently with the Q feedforward networks where the error function is the difference between the observed Q value and the target Q values.