End to end Active Object Tracking via Reinforcement Learning: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Added Related Work)
 
(51 intermediate revisions by 16 users not shown)
Line 1: Line 1:
=Introduction=
=Introduction=
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame. The process normally consists of:
Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame.  
1) Taking an initial set of object detections  
The process normally consists of the following steps.
2) Creating and assigning unique ID for each of the initial detections.  
<ol>
3) Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs.  
<li> Taking an initial set of object detections. </li>
 
<li> Creating and assigning a unique ID for each of the initial detections. </li>
There are two types of object tracking. 1) Passive tracking 2) Active tracking.
<li> Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs. </li>
</ol>
There are two types of object tracking. <ol> <li>Passive tracking</li> <li> Active tracking </li> </ol>


[[File:active_tracking_pipeline.PNG|500px|center]]
[[File:active_tracking_pipeline.PNG|500px|center]]


Much of the existing work has been done on passive tracking where it is assumed that the object of interest is always in the image scene so that there is no need to handle the camera control during tracking. Passive tracking though very useful is inapplicable in applications such as tracking performed by a mobile robot with a camera mounted or by a drone etc.   
Passive tracking assumes that the object of interest is always in the image scene, meaning that there is no need for camera control during tracking. Although passive tracking is very useful and well-researched with existing works, it is not applicable in situations like tracking performed by a camera-mounted mobile robot or by a drone.   
Active tracking involves two sub tasks. 1) Object Tracking 2) Camera Control. It is difficult to jointly tune the pipeline with two separate sub tasks. Tracking may involve many human efforts for bounding box labelling. Camera control is non-trivial and can incur many expensive trial-and-errors happening in real world.  
On the other hand, active tracking involves two subtasks, including 1) Object Tracking and 2) Camera Control. It is difficult to jointly tune the pipeline between these two separate subtasks. Object Tracking may require human efforts for bounding box labeling. In addition, Camera Control is non-trivial, which can lead to many expensive trial-and-errors in the real world.  
   
   
To address these challenges, in the paper an end-to-end active tracking solution via deep reinforcement learning is presented. More specifically ConvNet-LSTM network, taking raw video frames as input and outputting the camera movement actions.
To address these challenges, this paper presents an end-to-end active tracking solution via deep reinforcement learning. More specifically, the ConvNet-LSTM network takes raw video frames as input and outputs camera movement actions.
Virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e., the tracker) observes a state (a visual frame) from a first-person perspective and takes an action, and then the environment returns the updated state (next visual frame). AC3, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e. the tracker) observes a state (a visual frame) from a first-person perspective and takes an action. Then, the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the Virtual Environment is then tested on a real-world video dataset to check the generalization ability of the model.
Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the virtual environment is then tested on a real-world video dataset to assess the generalizability of the model. A video of the first version of this paper is available here[https://www.youtube.com/watch?v=C1Bn8WGtv0w].
 
=Intuition=
 
As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking submodules. Now we can train the entire system together as the error needs to be propagated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in the case of DQN). The training of these CNN happens concurrently with the Q feedforward networks where the error function is the difference between the observed Q value and the target Q values.  


=Related Work=  
=Related Work=  
Line 21: Line 27:


1) Subspace learning was adopted to update the appearance model of an object.  
1) Subspace learning was adopted to update the appearance model of an object.  
:Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking [2].


2) Multiple instance learning was employed to track an object.  
2) Multiple instance learning was employed to track an object.  
:Many researchers have shown that a tracking algorithm can achieve better performance by employing adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models is often difficult to update. So, Babenko et al. 2009 introduce a novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they show that tracking algorithms using weaker classifiers can still obtain superior performance [3].


3) Correlation filter based object tracking has achieved success in real-time object tracking.  
3) Correlation filter based object tracking has achieved success in real-time object tracking.  
:Correlation filter based object tracking algorithms attempt to “model the appearance of an object using filters”. At each frame, a small tracking window representing the target object is produced, and the tracker will correlate the windows over the image sequences, thus achieving object tracking. Bolme et al. 2010 validate this concept by creating a novel object tracking algorithm using an adaptive correlation filter called Minimum Output Sum of Squared Error (MOSSE) filter [4].


4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.  
4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.  
:Hare et al. 2016 argue the “sliding-window” approach use by popular object tracking algorithms is flawed because “the objective of the classifier (predicting labels for sliding-windows) is decoupled from the objective of the tracker (estimating object position).” Instead, they introduce a novel algorithm that uses “a kernelized structured output support vector machine (SVM) to avoid the need for intermediate classification”. Subsequently, they show the approach outperforms traditional trackers in various benchmarks [5].


5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.  
5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.  


:Long-Term Tracking is the task to recognize and track an object as it “moves in and out of a camera’s field of view”. This task is made difficult by problems such as an object reappearing into the scene and changing its appearance, scale, or illumination. Kalal et al. 2012 proposed a unified tracking framework (TLD) that accomplishes long-term tracking by “decomposing the task into tracking, learning, and detection”. Specifically, “the tracker follows an object from frame-to-frame; the detector localizes the object’s appearances; and, the learner improves the detector by learning from errors.” Altogether, the TLD framework outperforms previous state-of-arts tracking approaches [6].
6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.  
6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.  


For the active approaches, camera control and object tracking were considered as separate components. These approaches are difficult to tune. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.  
:In recent year, Deep Learning approaches are gaining prominence in the field of object tracking. For example, Wang et al. 2013 obtain outstanding results using a deep-learning based algorithm that combines offline feature extraction and online tracking using stacked denoising autoencoders. Whereas, Wang et al. 2016 introduced a sequential training convolutional network that can efficiently transfer offline learned features for online visual tracking applications.


In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.  
7) Pixel-level image classification.


==Object Tracking==
:Object identification is essentially pixel level classification, where each pixel in the image is given a label. It is a more general form of image classification. In recent years,  CNN has advanced many benchmarks in this field, and some AutoML methods, such as Neural Architecture Search has been applied in this field and achieved state of the art.  
As explained earlier, object tracking can be done in two ways active and passive. Passive object tracking has gained much popularity because of its simpler problem setting. Many approaches have been proposed to overcome difficulties resulting from issues such as occlusion and illumination variation. a) In (Ross et al., 2008) subspace learning was adopted to update the appearance model of an object and integrated into a particle filter framework for object tracking. b) Babenko et al. (Babenko et al., 2009) employed multiple instance learning to track an object. c) Correlation filter based object tracking (Valmadre et al., 2017; Active object tracking additionally considers camera control compared with traditional object tracking. There exists not much research attention in the area of active tracking. Conventional solutions dealt with object tracking and camera control in separate components (Denzler & Paulus, 1994; Murray & Basu, 1994; Kim et al., 2005), but these solutions are difficult to tune. Our proposal is completely different from them as it tackles object tracking and camera control simultaneously in an end-to-end manner.
 
==Reinforcement Learning==
For the active approaches, camera control and object tracking were considered as separate components. This configuration makes it much complex and therefore difficult to handle for current algorithms as the number of degrees of freedom usually increment exponentially depending on the specific context or experiment and also require a precise temporal coordination between the object detection and tracking tasks [9]. In comparison with passive approaches, fewer methods have been developed for active methods considering that several real-world applications require both object tracking and control to be run simultaneously such as managing control for an autonomous vehicle. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.  
Reinforcement Learning (RL) (Sutton & Barto, 1998) intends for a principled approach to temporal decision-making problems. In a typical RL framework, an agent learns from the environment a policy function that maps state to action at each discrete time step, where the objective is to maximize the accumulated rewards returned by the environment. Historically, RL has been successfully applied to inventory management, path planning, game playing, etc.
 
In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.


=Approach=
=Approach=
Virtual tracking scenes are generated for both training and testing. Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action from the following set of 6 actions.
Virtual tracking scenes are generated for both training and testing. An Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. For efficient training, data augmentation techniques and a customized reward function were used. An RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action <math>a_t</math> from the following set of 6 actions.  


\[A = \{turn-left, turn-right, turn-left-and-move-forward,\\ turn-right-and-move-forward, move-forward, no-op\}\]
\[A = \{\text{turn-left}, \text{turn-right}, \text{turn-left-and-move-forward},\\ \text{turn-right-and-move-forward}, \text{move-forward}, \text{no-op}\}\]


The action is processed by the environment, which returns to the agent the updated screen frame as well as the current reward.
The action is processed by the environment, which returns to the agent the current reward as well as the updated screen frame <math>(r_t, s_{t+1}) </math>.
==Tracking Scenarios==
==Tracking Scenarios==
Following two Virtual environment engines are used for the simulated training.
It is impossible to train the desired end-to-end active tracker
in real-world scenarios. Therefore, The following two Virtual environment engines are used for the simulated training.
===ViZDoom===  
===ViZDoom===  
ViZDoom (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, floor, and wall). The monster walks along a pre-specified path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.  
ViZDoom[http://vizdoom.cs.put.edu.pl/] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, floor, and wall). The monster walks along a pre-specified path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.  
  [[File:fig4.PNG|500px|center]]
  [[File:fig4.PNG|500px|center]]
===Unreal Engine===  
===Unreal Engine===  
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad influence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.
Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad influence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.
==AC3 Algorithm==
 
This paper employs the Asynchronous Actor-Critic Agents (AC3) algorithm for training the tracker.  
==A3C Algorithm==
At time step t, st denotes the observed sate corresponding to the raw RGB frame. The action set is denoted by A of size A of size K = |A|. An action, at ∈ A, is drawn from a policy function distribution: at ∽π(·|st) ∈ RK, referred to as an Actor. The environment then returns a reward rt ∈ R according to a reward function rt = g(st). The updated state st+1 at next time step t+1 is subject to a certain but unknown state transition function st+1 = f(st, at), governed by the environment.  
This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker.  
Trace consisting of sequence of triplets can be observed. τ = {. . . , (st, at, rt) , (st+1, at+1, rt+1) , . . .}
At time step t, <math>s_{t} </math> denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, <math>a_{t} </math> ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor.
Meanwhile, V (st) R denotes the expected accumulated reward in the future given state st (referred to as Critic). The policy function π (·) and the value function V (·) are then jointly modeled by a neural network. Rewriting these as π(·|st; Ɵ) and V (st; Ɵ ′) with parameters Ɵ and Ɵ’ respectively. The parameters are learnt over trace τ by simultaneous stochastic policy gradient and value function regression.
The environment then returns a reward <math>r_{t} \in \mathbb{R} </math> , according to a reward function <math>r_{t} = g(s_{t})</math>
[[File:equation12.PNG|500px|center]]
. The updated state <math>s_{t+1}</math> at next time step t+1 is subject to a certain but unknown state transition function <math> s_{t+1} = f(s_{t}, a_{t}) </math>, governed by the environment.  
Where Rt is a discounted sum of future rewards up to T time steps with a factor 0 < γ ≤ 1, α is the learning rate, H (·) is an entropy regularizer, and β is the regularizer factor.
Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\]
Meanwhile, <math>V(s_{t}) \in \mathbb{R} </math>  denotes the expected accumulated reward in the future given state <math>s_{t}</math> (referred to as Critic). The policy function <math> \pi(.)</math> and the value function <math>V (·)</math> are then jointly modeled by a neural network. Rewriting these as <math>\pi(.|s_{t};\theta)</math> and <math>V(s_{t};{\theta}')</math> with parameters <math>\theta</math> and <math>{\theta}'</math> respectively. The parameters are learned over trace <math>\tau</math> by simultaneous stochastic policy gradient and value function regression:
 
 
(1) <math>\theta\leftarrow\theta+\alpha(R_t-V(s_t))\bigtriangledown_\theta log\pi(a_t|s_t)+\beta\bigtriangledown_\theta H(\pi(.|s_t))</math>
 
(2) <math>\theta'\leftarrow\theta'-\alpha\bigtriangledown_{\theta'}(1/2)(R_t-V(S_t))^2</math>
 
 
Where <math>R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'}</math> is a discounted sum of future rewards up to <math>T</math> time steps with a factor <math>0 < \gamma \leq 1, \alpha</math> is the learning rate, <math>H (·)</math> is an entropy regularizer, and <math>\beta</math> is the regularizer factor.
 
This is a multi-threaded training process where each thread maintains an independent environment-agent interaction. Nevertheless, the network parameters are shared among threads and are updated asynchronously every T-time step using eq 1 in a lock-free manner in each thread. This multi-thread training is known to be fast and stable and it also improves the generalization(Mnih et al., 2016).
 
[[File:A3C High Level Diagram.png|thumb|center|High-level diagram that depicts the A3C algorithm across <math>n</math> environments]]
 
==Network Architecture==
==Network Architecture==
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture specification is given in the following table. The FC6 and FC1 correspond to the 6-action policy π (·|st) and the value V (st), respectively. The screen is resized to 84 × 84 × 3 RGB image as thea network input.
The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture specification is given in the following table. The FC6 and FC1 correspond to the 6-action policy <math>\pi (·|s_{t})</math> and the value <math>V (s_{t})</math>, respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.
  [[File:network-architecture.PNG|500px|center]]
  [[File:network-architecture.PNG|500px|center]]
[[File:table.PNG|500px|center]]
[[File:table.PNG|500px|center]]
==Reward Function==
==Reward Function==
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder, and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S.
The reward function is defined as follows.
The reward function is defined as follows.
  [[File:reward_function.PNG|300px|center]]
  [[File:reward_function.PNG|300px|center]]
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distance d and exhibits no rotation.
Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distanced and exhibits no rotation.
Environment Augmentation: To make the tracker generalize well, a environment augmentation technique is proposed for both virtual environments. For ViZDoom, (x,y,a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations {xi, yi, ai}iN=1 . Further flipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode.
==Environment Augmentation==
For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen. Every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.
To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments.  
=Experimental Results=
 
For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations <math>\{x_{i},y_{i},a_{i}\}_{i=1}^{N}</math>. Further flipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode. This makes the generalization ability of the tracker be improved.
 
For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen and making them invisible. Simultaneously, every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.
 
=Experimental Settings=
==Environment Setup==
==Environment Setup==
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-specified, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difficulty to the tracking problem.  
A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-specified, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difficulty to the tracking problem.  
Line 78: Line 116:
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.
==Metric==
==Metric==
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. EL roughly measures the duration of good tracking.  
Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).


=Results=
=Results=
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(with the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(without the augmentation technique). However, only the results for RandomizedEnv are reported in the paper.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. The variability in the test results is very high for the non-augmented training case.
There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. Compared to RandomizedEnv, SingleEnv does not exploit the capacity of the network better. The variability in the test results is very high for the non-augmented training case.
[[File:table1.PNG|400px|center]]  
[[File:table1.PNG|400px|center]]  
The testing environments results are reported in Tab. 2.   
The testing environments results are reported in Tab. 2.  These are 8 more challenging test environments that present different target appearances, different backgrounds, more varied paths and distracting targets comparing to the training environment.
[[File:msm_table2.PNG|400px|center]]
[[File:msm_table2.PNG|400px|center]]
Following are the findings from the testing results:
Following are the findings from the testing results:
Line 98: Line 136:
The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon.  The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.
The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon.  The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.
Testing in the UE environment is tabulated in Table 5.
 
Testing in the UE environment is tabulated in Table 5. Four different environments are tested and based on the long-term TLD tracker.  
  [[File:table5.PNG|400px|center]]
  [[File:table5.PNG|400px|center]]
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overfit to a specialized appearance.  
1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overfit to a specialized appearance.  
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path.
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.  
3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential.  
4. In most cases, the proposed tracker outperforms the simulated active tracker, or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difficult to tune a unified camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).  
4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difficult to tune a unified camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).  


Real world active tracking: To test and evaluate the tracker in real world scenarios, the network trained on UE environment is tested on few videos from the VOT dataset.   
Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.   


[[File:fig7.PNG|400px|center]]
[[File:fig7.PNG|400px|center]]


Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the figure show, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait the target to move farther.
Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the figure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.
 
Video Link to the experimental results can be found below:
[https://youtu.be/C1Bn8WGtv0w Video Demonstration of the Results]
 
Supplementary Material for Further Experiments:
[http://proceedings.mlr.press/v80/luo18a/luo18a-supp.zip Additional PDF and Video]
 
Action Saliency Map: An input frame is fed into the tracker and forwarded to output the policy function. An action will be sampled subsequently. Then the gradient of this action is propagated backwards to the input layer,  the saliency map is generated. According to the saliency map, how the input image affects the tracker's action can be observed. Fig. 8 shows the tracker indeed learns how to find the target, which improves the performance of the model.
[[File:fig8.PNG|400px|center]]
 
=Conclusion=
=Conclusion=
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trail-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.
=Critique=
=Critique=
The paper presents a solution for active tracking using reinforcement learning. The tracker trained using environment augmentation performs better than the one trained without augmentation. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. Within the virtual environment ViZDoom though used for training and testing seems to have little or no generalizability in real world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environment parameters look positive, but the relatively simple nature of the environment needs to be considered. Also, when the floor is replaced by the ceiling the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat learnt overfitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated, and the camera control algorithm is designed for this specific comparison and further testing is required for benchmarking. The real-world challenges of intensity variation, camera details, control signals though out of scope of the current paper, still need to be considered while discussing the generalizability of the model to real world scenarios. The results on the real-world videos show a positive result towards the generalizability of the models in real world settings. The overall approach presented in the paper is intuitive and the results look promising.
The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for benchmarking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action
space includes only six discrete actions, which are inadequate for deployment in the real world because the tracker cannot adapt to the different moving speed of the target. It is also believed
that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target.
The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.
 
=Future Work=
The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transferability tailored to the real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.
 
=References=
[https://arxiv.org/pdf/1808.03405.pdf 1] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.
 
[2] Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.
 
[3] Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.
 
[4] Bolme, David S, Beveridge, J Ross, Draper, Bruce A, and Lui, Yui Man. Visual object tracking using adaptive correlation filters. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550, 2010.
 
[5] Hare, Sam, Golodetz, Stuart, Saffari, Amir, Vineet, Vibhav, Cheng, Ming-Ming, Hicks, Stephen L, and Torr, Philip HS. Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2096–2109, 2016.
 
[6] Kalal, Zdenek, Mikolajczyk, Krystian, and Matas, Jiri. Tracking- learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.
 
[7] Wang, Naiyan and Yeung, Dit-Yan. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pp. 809–817, 2013.
 
[8] Wang, Lijun, Ouyang, Wanli, Wang, Xiaogang, and Lu, Huchuan. Stct: Sequentially training convolutional networks for visual tracking. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1373–1381, 2016.
 
[9] Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Detect to track and track to detect." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

Latest revision as of 18:27, 16 December 2018

Introduction

Object tracking has been a hot topic in recent years. It involves localization of an object in continuous video frames given an initial annotation in the first frame. The process normally consists of the following steps.

  1. Taking an initial set of object detections.
  2. Creating and assigning a unique ID for each of the initial detections.
  3. Tracking those objects as they move around in the video frames, maintaining the assignment of unique IDs.

There are two types of object tracking.

  1. Passive tracking
  2. Active tracking

Passive tracking assumes that the object of interest is always in the image scene, meaning that there is no need for camera control during tracking. Although passive tracking is very useful and well-researched with existing works, it is not applicable in situations like tracking performed by a camera-mounted mobile robot or by a drone. On the other hand, active tracking involves two subtasks, including 1) Object Tracking and 2) Camera Control. It is difficult to jointly tune the pipeline between these two separate subtasks. Object Tracking may require human efforts for bounding box labeling. In addition, Camera Control is non-trivial, which can lead to many expensive trial-and-errors in the real world.

To address these challenges, this paper presents an end-to-end active tracking solution via deep reinforcement learning. More specifically, the ConvNet-LSTM network takes raw video frames as input and outputs camera movement actions. The virtual environment is used to simulate active tracking. In a virtual environment, an agent (i.e. the tracker) observes a state (a visual frame) from a first-person perspective and takes an action. Then, the environment returns the updated state (next visual frame). A3C, a modern Reinforcement Learning algorithm, is adopted to train the agent, where a customized reward function is designed to encourage the agent to be closely following the object. Environment augmentation technique is used to boost the tracker’s generalization ability. The tracker trained in the virtual environment is then tested on a real-world video dataset to assess the generalizability of the model. A video of the first version of this paper is available here[1].

Intuition

As in the case of the state of the art models, if the action module and the object tracking module are completely different, it is extremely difficult to train one or the other as it is impossible to know which is causing the error that is being observed at the end of the episode. The function of both these modules are the same at a high level as both are aiming for efficient navigation. So it makes sense to have a joint module that consists of both the observation and the action taking submodules. Now we can train the entire system together as the error needs to be propagated to the whole system. This is in line with the common practice in Deep Reinforcement Learning where the CNNs used to extract features in the case of Atari games are combined with the Q networks (in the case of DQN). The training of these CNN happens concurrently with the Q feedforward networks where the error function is the difference between the observed Q value and the target Q values.

Related Work

In the domain of object tracking, there are both active and passive approaches. The below summarize the advance passive object tracking approaches:

1) Subspace learning was adopted to update the appearance model of an object.

Formerly, object tracking algorithms employ a fixed appearance model. Consequently, they often perform poorly when the target object changes in appearance or illumination. To overcome this problem, Ross et al. 2008 introduce a novel tracking method that incrementally adapts the appearance model according to new observations made during tracking [2].

2) Multiple instance learning was employed to track an object.

Many researchers have shown that a tracking algorithm can achieve better performance by employing adaptive appearance models capable of separating an object from its background. However, the discriminative classifier in those models is often difficult to update. So, Babenko et al. 2009 introduce a novel algorithm that updates its appearance model using a “bag” of positive and negative examples. Subsequently, they show that tracking algorithms using weaker classifiers can still obtain superior performance [3].

3) Correlation filter based object tracking has achieved success in real-time object tracking.

Correlation filter based object tracking algorithms attempt to “model the appearance of an object using filters”. At each frame, a small tracking window representing the target object is produced, and the tracker will correlate the windows over the image sequences, thus achieving object tracking. Bolme et al. 2010 validate this concept by creating a novel object tracking algorithm using an adaptive correlation filter called Minimum Output Sum of Squared Error (MOSSE) filter [4].

4) Structured Output predicted was used to constrain object tracking and avoiding converting positions to labels of training samples.

Hare et al. 2016 argue the “sliding-window” approach use by popular object tracking algorithms is flawed because “the objective of the classifier (predicting labels for sliding-windows) is decoupled from the objective of the tracker (estimating object position).” Instead, they introduce a novel algorithm that uses “a kernelized structured output support vector machine (SVM) to avoid the need for intermediate classification”. Subsequently, they show the approach outperforms traditional trackers in various benchmarks [5].

5) Tracking, learning, and Detection were integrated into one framework for long-term tracking, where a detection module was used to re-initialize the tracker once a missing object reappears.

Long-Term Tracking is the task to recognize and track an object as it “moves in and out of a camera’s field of view”. This task is made difficult by problems such as an object reappearing into the scene and changing its appearance, scale, or illumination. Kalal et al. 2012 proposed a unified tracking framework (TLD) that accomplishes long-term tracking by “decomposing the task into tracking, learning, and detection”. Specifically, “the tracker follows an object from frame-to-frame; the detector localizes the object’s appearances; and, the learner improves the detector by learning from errors.” Altogether, the TLD framework outperforms previous state-of-arts tracking approaches [6].

6) Deep learning models like stacked autoencoder have been used to learn good representations for object tracking.

In recent year, Deep Learning approaches are gaining prominence in the field of object tracking. For example, Wang et al. 2013 obtain outstanding results using a deep-learning based algorithm that combines offline feature extraction and online tracking using stacked denoising autoencoders. Whereas, Wang et al. 2016 introduced a sequential training convolutional network that can efficiently transfer offline learned features for online visual tracking applications.

7) Pixel-level image classification.

Object identification is essentially pixel level classification, where each pixel in the image is given a label. It is a more general form of image classification. In recent years, CNN has advanced many benchmarks in this field, and some AutoML methods, such as Neural Architecture Search has been applied in this field and achieved state of the art.

For the active approaches, camera control and object tracking were considered as separate components. This configuration makes it much complex and therefore difficult to handle for current algorithms as the number of degrees of freedom usually increment exponentially depending on the specific context or experiment and also require a precise temporal coordination between the object detection and tracking tasks [9]. In comparison with passive approaches, fewer methods have been developed for active methods considering that several real-world applications require both object tracking and control to be run simultaneously such as managing control for an autonomous vehicle. This paper tackles object tracking and camera control simultaneously in an end to end manner and is easy to tune.

In the domain of domain of deep reinforcement learning, recent algorithms have achieved advanced gameplay in games like GO and Atari games. They have also been used in computer vision tasks like object localization, region proposal, and visual tracking. All advancements pertain to passive tracking but this paper focusses on active tracking using Deep RL, which has never been tried before.

Approach

Virtual tracking scenes are generated for both training and testing. An Asynchronous Actor-Critic Agents (A3C) model was used to train the tracker. For efficient training, data augmentation techniques and a customized reward function were used. An RGB screen frame of the first-person perspective was chosen as the state for the study. The tracker observes a visual state and takes one action [math]\displaystyle{ a_t }[/math] from the following set of 6 actions.

\[A = \{\text{turn-left}, \text{turn-right}, \text{turn-left-and-move-forward},\\ \text{turn-right-and-move-forward}, \text{move-forward}, \text{no-op}\}\]

The action is processed by the environment, which returns to the agent the current reward as well as the updated screen frame [math]\displaystyle{ (r_t, s_{t+1}) }[/math].

Tracking Scenarios

It is impossible to train the desired end-to-end active tracker in real-world scenarios. Therefore, The following two Virtual environment engines are used for the simulated training.

ViZDoom

ViZDoom[2] (Kempka et al., 2016; ViZ) is an RL research platform based on a 3D FPS video game called Doom. In ViZDoom, the game engine corresponds to the environment, while the video game player corresponds to the agent. The agent receives from the environment a state and a reward at each time step. In this study, customized ViZDoom maps are used. (see Fig. 4) composed of an object (a monster) and background (ceiling, floor, and wall). The monster walks along a pre-specified path programmed by the ACS script (Kempka et al., 2016), and the goal is to train the agent, i.e., the tracker, to follow closely the object.

Unreal Engine

Though convenient for research, ViZDoom does not provide realistic scenarios. To this end, Unreal Engine (UE) is adopted to construct nearly real-world environments. UE is a popular game engine and has a broad influence in the game industry. It provides realistic scenarios which can mimic real-world scenes. UnrealCV (Qiu et al., 2017) is employed in this study, which provides convenient APIs, along with a wrapper (Zhong et al., 2017) compatible with OpenAI Gym (Brockman et al., 2016), for interactions between RL algorithms and the environments constructed based on UE.

A3C Algorithm

This paper employs the Asynchronous Actor-Critic Agents (A3C) algorithm for training the tracker. At time step t, [math]\displaystyle{ s_{t} }[/math] denotes the observed state corresponding to the raw RGB frame. The action set is denoted by A of size K = |A|. An action, [math]\displaystyle{ a_{t} }[/math] ∈ A, is drawn from a policy function distribution: \[a_{t}\sim \pi\left ( . | s_{t} \right ) \in \mathbb{R}^{k} \] This is referred to as actor. The environment then returns a reward [math]\displaystyle{ r_{t} \in \mathbb{R} }[/math] , according to a reward function [math]\displaystyle{ r_{t} = g(s_{t}) }[/math] . The updated state [math]\displaystyle{ s_{t+1} }[/math] at next time step t+1 is subject to a certain but unknown state transition function [math]\displaystyle{ s_{t+1} = f(s_{t}, a_{t}) }[/math], governed by the environment. Trace consisting of a sequence of triplets can be observed. \[\tau = \{\ldots, (s_{t}, a_{t}, r_{t}) , (s_{t+1}, a_{t+1}, r_{t+1}) , \ldots \}\] Meanwhile, [math]\displaystyle{ V(s_{t}) \in \mathbb{R} }[/math] denotes the expected accumulated reward in the future given state [math]\displaystyle{ s_{t} }[/math] (referred to as Critic). The policy function [math]\displaystyle{ \pi(.) }[/math] and the value function [math]\displaystyle{ V (·) }[/math] are then jointly modeled by a neural network. Rewriting these as [math]\displaystyle{ \pi(.|s_{t};\theta) }[/math] and [math]\displaystyle{ V(s_{t};{\theta}') }[/math] with parameters [math]\displaystyle{ \theta }[/math] and [math]\displaystyle{ {\theta}' }[/math] respectively. The parameters are learned over trace [math]\displaystyle{ \tau }[/math] by simultaneous stochastic policy gradient and value function regression:


(1) [math]\displaystyle{ \theta\leftarrow\theta+\alpha(R_t-V(s_t))\bigtriangledown_\theta log\pi(a_t|s_t)+\beta\bigtriangledown_\theta H(\pi(.|s_t)) }[/math]

(2) [math]\displaystyle{ \theta'\leftarrow\theta'-\alpha\bigtriangledown_{\theta'}(1/2)(R_t-V(S_t))^2 }[/math]


Where [math]\displaystyle{ R_{t} = \sum_{{t}'=t}^{t+T-1} \gamma^{{t}'-t}r_{{t}'} }[/math] is a discounted sum of future rewards up to [math]\displaystyle{ T }[/math] time steps with a factor [math]\displaystyle{ 0 \lt \gamma \leq 1, \alpha }[/math] is the learning rate, [math]\displaystyle{ H (·) }[/math] is an entropy regularizer, and [math]\displaystyle{ \beta }[/math] is the regularizer factor.

This is a multi-threaded training process where each thread maintains an independent environment-agent interaction. Nevertheless, the network parameters are shared among threads and are updated asynchronously every T-time step using eq 1 in a lock-free manner in each thread. This multi-thread training is known to be fast and stable and it also improves the generalization(Mnih et al., 2016).

High-level diagram that depicts the A3C algorithm across [math]\displaystyle{ n }[/math] environments

Network Architecture

The tracker is a ConvNet-LSTM neural network as shown in Fig. 2, where the architecture specification is given in the following table. The FC6 and FC1 correspond to the 6-action policy [math]\displaystyle{ \pi (·|s_{t}) }[/math] and the value [math]\displaystyle{ V (s_{t}) }[/math], respectively. The screen is resized to 84 × 84 × 3 RGB images as the network input.

Reward Function

The reward function utilizes a two-dimensional local coordinate system (S). The x-axis points from the agent’s left shoulder to right shoulder and the y-axis points perpendicular to the x-axis and points to the agent’s front. The origin is where is the agent is. System S is parallel to the floor. The object’s local coordinate (x,y) and orientation a with regard to the system S. The reward function is defined as follows.

Where A>0, c>0, d>0 and λ>0 are tuning parameters. The reward equation states that the maximum reward A is achieved when the object stands perfectly in front of the agent with distanced and exhibits no rotation.

Environment Augmentation

To make the tracker generalize well, an environment augmentation technique is proposed for both virtual environments.

For ViZDoom, (x,y, a) define the system state. For augmentation the initial system state is perturbed N times by editing the map with ACS script (Kempka et al., 2016), yielding a set of environments with varied initial positions and orientations [math]\displaystyle{ \{x_{i},y_{i},a_{i}\}_{i=1}^{N} }[/math]. Further flipping left-right the screen frame (and accordingly the left-right action) is allowed. As a result, 2N environments are obtained out of one environment. During A3C training, one of the 2N environments is randomly sampled at the beginning of every episode. This makes the generalization ability of the tracker be improved.

For UE, an environment with a character/target following a fixed path is constructed. To augment the environment, random background objects are chosen and making them invisible. Simultaneously, every episode starts from the position, where the agent fails at the last episode. This makes the environment and starting point different from episode to episode, so the variations of the environment during training are augmented.

Experimental Settings

Environment Setup

A set of environments are produced for both training and testing. For ViZDoom, a training map as in Fig. 4, left column is adopted. This map is then augmented with N = 21, leading to 42 environments that can be sampled from during training. For testing, 9 maps are made, some of which are shown in Fig. 4, middle and right columns. In all maps, the path of the target is pre-specified, indicated by the blue lines. However, it is worth noting that the object does not strictly follow the planned path. Instead, it sometimes randomly moves in a “zig-zag” way during the course, which is a built-in game engine behavior. This poses an additional difficulty to the tracking problem. For UE, an environment named Square with random invisible background objects is generated and a target named Stefani walking along a fixed path for training. For testing, another four environments named as Square1StefaniPath1 (S1SP1), Square1MalcomPath1 (S1MP1), Square1StefaniPath2 (S1SP2), and Square2MalcomPath2 (S2MP2) are made. As shown in Fig. 5, Square1 and Square2 are two different maps, Stefani and Malcom are two characters/targets, and Path1 and Path2 are different paths. Note that, the training environment Square is generated by hiding some background objects in Square1. For both ViZDoom and UE, an episode is terminated when either the accumulated reward drops below a threshold or the episode length reaches a maximum number. In these experiments, the reward threshold is set as -450 and the maximum length as 3000, respectively.

Metric

Two metrics are employed for the experiments. Accumulated Reward (AR) and Episode Length (EL). AR is like Precision in the conventional tracking literature. An AR that is too small leads to termination of the episode because it essentially means a failure of tracking. EL roughly measures the duration of good tracking and is analogous to the metric Successfully Tracked Frames in conventional tracking applications. The theoretical maximum for both AR and EL is 3000 when letting A = 1.0 in the reward function (because of the termination criterion).

Results

Two training protocols were followed namely RandomizedEnv(with augmentation) and SingleEnv(without the augmentation technique). However, only the results for RandomizedEnv are reported in the paper. There is only one table specifying the result from SingleEnv training which shows that it performs worse than the RandomizedEnv training. Compared to RandomizedEnv, SingleEnv does not exploit the capacity of the network better. The variability in the test results is very high for the non-augmented training case.

The testing environments results are reported in Tab. 2. These are 8 more challenging test environments that present different target appearances, different backgrounds, more varied paths and distracting targets comparing to the training environment.

Following are the findings from the testing results: 1. The tracker generalizes well in the case of target appearance changing (Zombie, Cacodemon). 2. The tracker is insensitive to background variations such as changing the ceiling and floor (FloorCeiling) or placing additional walls in the map (Corridor). 3. The tracker does not lose a target even when the target takes several sharp turns (SharpTurn). Note that in conventional tracking, the target is commonly assumed to move smoothly. 4. The tracker is insensitive to a distracting object (Noise1) even when the “bait” is very close to the path (Noise2).

The proposed tracker is compared against several of the conventional trackers with PID like module for camera control to simulate active tracking. The results are displayed in Tab. 3.

The camera control module is implemented such that in the first frame, a manual bounding box must be given to indicate the object to be tracked. For each subsequent frame, the passive tracker then predicts a bounding box which is passed to the Camera Control module. A comparison is made between the two subsequent bounding boxes as per the algorithm and action decision is made. The results show that the proposed solution outperforms the simulated active tracker. The simulated trackers lost their targets soon. The Meanshift tracker works well when there is no camera shift between continuous frames. Both KCF and Correlation trackers seem not capable of handling such a large camera shift, so they do not work as well as the case in passive tracking. The MIL tracker works reasonably in the active case, while it easily drifts when the object turns suddenly.

Testing in the UE environment is tabulated in Table 5. Four different environments are tested and based on the long-term TLD tracker.

1. Comparison between S1SP1 and S1MP1 shows that the tracker generalizes well even when the model is trained with target Stefani, revealing that it does not overfit to a specialized appearance. 2. The active tracker performs well when changing the path (S1SP1 versus S1SP2), demonstrating that it does not act by memorizing specialized path. 3. When the map is changed, target, and path at the same time (S2MP2), though the tracker could not seize the target as accurately as in previous environments (the AR value drops), it can still track objects robustly (comparable EL value as in previous environments), proving its superior generalization potential. 4. In most cases, the proposed tracker outperforms the simulated active tracker or achieves comparable results if it is not the best. The results of the simulated active tracker also suggest that it is difficult to tune a unified camera-control module for them, even when a long-term tracker is adopted (see the results of TLD).

Real world active tracking: To test and evaluate the tracker in real-world scenarios, the network trained on UE environment is tested on a few videos from the VOT dataset.

Fig. 7 shows the output actions for two video clips named Woman and Sphere, respectively. The horizontal axis indicates the position of the target in the image, with a positive (negative) value meaning that a target in the right (left) part. The vertical axis indicates the size of the target, i.e., the area of the ground truth bounding box. Green and red dots indicate turn-left/turn-left-and-move-forward and turn-right/turn-right-and-move-forward actions, respectively. Yellow dots represent No-op action. As the figure shows, 1) When the target resides in the right (left) side, the tracker tends to turn right (left), trying to move the camera to “pull” the target to the center. 2) When the target size becomes bigger, which probably indicates that the tracker is too close to the target, the tracker outputs no-op actions more often, intending to stop and wait for the target to move farther.

Video Link to the experimental results can be found below: Video Demonstration of the Results

Supplementary Material for Further Experiments: Additional PDF and Video

Action Saliency Map: An input frame is fed into the tracker and forwarded to output the policy function. An action will be sampled subsequently. Then the gradient of this action is propagated backwards to the input layer, the saliency map is generated. According to the saliency map, how the input image affects the tracker's action can be observed. Fig. 8 shows the tracker indeed learns how to find the target, which improves the performance of the model.

Conclusion

In the paper, an end-to-end active tracker via deep reinforcement learning is proposed. Unlike conventional passive trackers, the proposed tracker is trained in simulators, saving the efforts of human labeling or trial-and-errors in real-world. It shows good generalization to unseen environments. The tracking ability can potentially transfer to real-world scenarios.

Critique

The paper presents a solution for active tracking using reinforcement learning. A ConvNet-LSTM network has been adopted. Environment augmentation has been proposed for training the network. The tracker trained using environment augmentation performs better than the one trained without it. This is true in both the ViZDoom and UE environment. The reward function looks intuitive for the task at hand which is object tracking. The virtual environment ViZDoom though used for training and testing, seems to have little or no generalization ability in real-world scenarios. The maps in ViZDoom itself are very simple. The comparison presented in the paper for the ViZDoom testing with changes in the environmental parameters look positive, but the relatively simple nature of the environment needs to be considered while looking at these results. Also, when the floor is replaced by the ceiling, the tracker performs worst in comparison to the other cases in the table, which seems to indicate that the floor and ceiling parameters are somewhat over-fitted in the model. The tracker trained in UE environment is tested against simulated trackers. The results show that the proposed solution performs better than the simulated trackers. However, since the trackers are simulated using the camera control algorithm written for this specific comparison, further testing is required for benchmarking. The real-world challenges of intensity variation, camera details, control signals through beyond the scope of the current paper, still need to be considered while discussing the generalization ability of the model to real-world scenarios. For example, the current action space includes only six discrete actions, which are inadequate for deployment in the real world because the tracker cannot adapt to the different moving speed of the target. It is also believed that training the tracker in UE simulator alone is sufficient for a successful real-world deployment. It is better to randomize more aspects of the environment during training, including the texture of each mesh, the illumination condition of the scene, the trajectory of the target as well as the speed of the target. The results on the real-world videos show a positive result towards the generalization ability of the models in real-world settings. The overall approach presented in the paper is intuitive and the results look promising.

Future Work

The authors did some future work for this paper in several ways. Basically, they implemented a successful robot. Moreover, they enhanced the system to deal with the virtual-to-real gap [1]. Specifically, 1) more advanced environment augmentation techniques have been proposed to boost the environment diversity, which improves the transferability tailored to the real world. 2) A more appropriate action space compared with the conference paper is developed, and using a continuous action space for active tracking is investigated. 3) A mapping from the neural network prediction to the robot control signal is established so as to successfully deliver the end-to-end tracking.

References

1 W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning”.

[2] Ross, David A, Lim, Jongwoo, Lin, Ruei-Sung, and Yang, Ming- Hsuan. Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1-3):125–141, 2008.

[3] Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge. Visual tracking with online multiple instance learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990, 2009.

[4] Bolme, David S, Beveridge, J Ross, Draper, Bruce A, and Lui, Yui Man. Visual object tracking using adaptive correlation filters. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550, 2010.

[5] Hare, Sam, Golodetz, Stuart, Saffari, Amir, Vineet, Vibhav, Cheng, Ming-Ming, Hicks, Stephen L, and Torr, Philip HS. Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2096–2109, 2016.

[6] Kalal, Zdenek, Mikolajczyk, Krystian, and Matas, Jiri. Tracking- learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.

[7] Wang, Naiyan and Yeung, Dit-Yan. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pp. 809–817, 2013.

[8] Wang, Lijun, Ouyang, Wanli, Wang, Xiaogang, and Lu, Huchuan. Stct: Sequentially training convolutional networks for visual tracking. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1373–1381, 2016.

[9] Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Detect to track and track to detect." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.