Zero-Shot Visual Imitation

From statwiki
Revision as of 20:49, 11 November 2018 by S362khan (talk | contribs) (Learning the Goal-Conditioned Skill Policy (GSP))
Jump to: navigation, search

This page contains a summary of the paper "Zero-Shot Visual Imitation" by Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P. et al. It was published at the International Conference on Learning Representations (ICLR) in 2018.


The dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both what and how to imitate for a certain task. For example, in the robotics field, Learning from Demonstration (LfD) (Argall et al., 2009; Ng & Russell, 2000; Pomerleau, 1989; Schaal, 1999) requires an expert to manually move robot joints (kinesthetic teaching) or teleoperate the robot to teach the desired task. The expert will, in general, provide multiple demonstrations of a specific task at training time which the agent will form into observation-action pairs to then distill into a policy for performing the task. In the case of demonstrations for a robot, this heavily supervised process is tedious and unsustainable especially looking at the fact that new tasks need a set of new demonstrations for the robot to learn from. In this paper, an alternative paradigm is pursued wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. Videos, models, and more details are available at [[1]].

Paper Overview

Observational Learning (Bandura & Walters, 1977), a term from the field of psychology, suggests a more general formulation where the expert communicates what needs to be done (as opposed to how something is to be done) by providing observations of the desired world states via video or sequential images. This is the proposition of the paper and while this is a harder learning problem, it is possibly more useful because the expert can now distill a large number of tasks easily (and quickly) to the agent.

Figure 1: The goal-conditioned skill policy (GSP) takes as input the current and goal observations and outputs an action sequence that would lead to that goal. We compare the performance of the following GSP models: (a) Simple inverse model; (b) Multi-step GSP with previous action history; (c) Multi-step GSP with previous action history and a forward model as regularizer, but no forward consistency; (d) Multi-step GSP with forward consistency loss proposed in this work.

This paper follows (Agrawal et al., 2016; Levine et al., 2016; Pinto & Gupta, 2016) where an agent first explores the environment independently and then distills its observations into goal-directed skills. The word 'skill' is used to denote a function that predicts the sequence of actions to take the agent from the current observation to the goal. This function is what is known as a goal-conditioned skill policy (GSP) and it is learned by re-labeling states that the agent has visited as goals and the actions taken as prediction targets. During inference, the GSP recreates the task step-by-step given the goal observations from the demonstration.

A challenge of learning the GSP is that the distribution of trajectories from one state to another is multi-modal; that is, there are many possible ways of traversing from one state to another. This issue is addressed with the main contribution of this paper, the forward-consistent loss which essentially says that reaching the goal is more important than how it is reached. First, a forward model is learned that predicts the next observation from the given action and current observation. The difference in the output of the forward model for the GSP-selected action and the ground-truth next state is used to train the model. This forward-consistent loss has the effect of not inadvertently penalizing actions that are consistent with the ground-truth action but not exactly the same.

As a simple example to explain the forward-consistent loss, imagine a scenario where a robot must grab an object some distance ahead with an obstacle along the pathway. Now suppose that during demonstration the obstacle is avoided by going to the right and then grabbing the object while the agent during training decides to go left and then grab the object. The forward-consistent loss would characterize the action of the robot as consistent with the ground-truth action of the demonstrator and not penalize the robot for going left instead of right.

Of course, when introducing something like this forward-consistent loss, issues related to the number of steps needed to reach a certain goal become prevalent. To address this, the paper pairs the GSP with a goal recognizer that determines if the goal has been satisfied with respect to some metrics. Figure 1 shows various GSPs along with diagram d) showing the forward-consistent loss proposed in this paper.

The zero-shot imitator is tested on a Baxter robot performing tasks involving rope manipulation, a TurtleBot performing office navigation and navigation experiments in VizDoom. Positive results are shown for all three experiments leading to the conclusion that the forward-consistent GSP can be used to imitate a variety of tasks without making environmental or task-specific assumptions.

Related Work

Some key ideas related to this paper are imitation learning, visual demonstration, forward/inverse dynamics and consistency and finally, goal conditioning. The paper has more on each of these topics including citations to related papers. The propositions in this paper are related to imitation learning but the problem being addressed is different in that there is less supervision and the model requires generalization across tasks during inference.

Learning to Imitate Without Expert Supervision

In this section (and the included subsections) the methods for learning the GSP, forward consistency loss and goal recognizer network are described.

Let [math]S : \{x_1, a_1, x_2, a_2, ..., x_T\}[/math] be the sequence of observation-action pairs generated by the agent as it explores the environment. This exploration data is used to learn the GSP policy.

[math]\overrightarrow{a}_τ =π (x_i, x_g; θ_π)[/math]

The learned GSP policy ([math]π[/math]) takes as input a pair of observations [math](x_i, x_g)[/math] and outputs a sequence of actions [math](\overrightarrow{a}_τ : a_1, a_2, ..., a_K)[/math] to reach the goal observation [math]x_g[/math] starting from the current observation [math]x_i[/math]. The states (observations) [math]x_i[/math] and [math]x_g[/math] are sampled from [math]S[/math] and need not be consecutive. Given the start and stop states, the number of actions [math]K[/math] is also known. [math]π[/math] can be though of as a deep network with parameters [math]θ_π[/math].

At test time, the expert demonstrates a task from which the agent captures a sequence of observations. This set of images is denoted by [math]D: \{x_1^d, x_2^d, ..., x_N^d\}[/math]. The sequence needs to have at least one entry and can be as temporally dense as needed (i.e. the expert can show as many goals or sub-goals as needed to the agent). The agent then uses its learned policy to start from initial state [math]x_0[/math] and generate actions predicted by [math]π(x_0, x_1^d; θ_π)[/math] to follow the observations in [math]D[/math].

The agent does not have access to the sequence of actions performed by the agent. Hence, it must use the observations to determine if it has reached the goal. A separate goal recognizer network is needed to ascertain if the current observation is close to the current goal or not. This is because multiple actions might be required to reach close to [math]x_1^d[/math]. Knowing this, let [math]x_0^\prime[/math] be the observation after executing the predicted action. The goal recognizer evaluates whether [math]x_0^\prime[/math] is sufficiently close to the goal and if not, the agent executes [math]a = π(x_0^\prime, x_1^d; θ_π)[/math]. Then after reaching sufficiently close to [math]x_1^d[/math], the agent sets [math]x_2^d[/math] as the goal and executes actions. This process is executed repeatedly for each image in [math]D[/math] until the final goal is reached.

Learning the Goal-Conditioned Skill Policy (GSP)

in this section, first, the one-step version GSP policy is described. Next, it is extend it to the multi-step version.

A one-step trajectory can be described as [math](x_t; a_t; x_{t+1})[/math]. Given [math](x_t, x_{t+1})[/math] the GSP policy estimates an action, [math]\hat{a}_t = π(x_t; x_{t+1}; θ_π)[/math]. During training, cross-entropy loss is used to learn GSP parameters [math]θ_π[/math]:

[math]L(a_t; \hat{a}_t) = p(a_t|x_t; x_{t+1}) log( \hat{a}_t)[/math]

[math]a_t[/math] and [math]\hat{a}_t[/math] are the ground-truth and predicted actions respectively. The conditional distribution [math]p[/math] is not readily available so it needs to be empirically approximated using the data. In a standard deep learning problem it is common to assume [math]p[/math] as a delta function at [math]a_t[/math]; given a specific input, the network outputs a single output. However, in this problem multiple actions can lead to the same output. Multiple outputs given a single input can be modeled using a variation auto-encoder. However, the authors use a different approach explained in sections 2.2-2.4 and in the following sections.

Forward Consistency Loss

To deal with multi-modality, this paper proposes the forward consistency loss where instead of penalizing actions predicted by the GSP to match the ground truth, the parameters of the GSP are learned such that they minimize the distance between observation [math]\hat{x}_{t+1}[/math] (prediction from executing [math]\hat{a}_t = π(x_t, x_{t+1}; θ_π)[/math] ) and the observation [math]x_{t+1}[/math] (ground truth). This is done so that the predicted action is not penalized if it leads to the same next state as the ground-truth action. This will in turn reduce the variation in gradients and aid the learning process. This is what is denoted as forward consistency loss.

To operationalize the forward consistency loss, we need a differentiable "forward dynamics" model that can reliably predict results of an action. The forward dynamics [math]f[/math] are learned from the data and is defined as [math]\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)[/math]. Since [math]f[/math] is not analytic, there is no guarantee that [math]\widetilde{x}_{t+1} = \hat{x}_{t+1} [/math] so an additional term is added to the loss: [math]||x_{t+1} - \hat{x}_{t+1}||_2^2 [/math]. The parameters of [math]θ_f[/math] are inferred by minimizing [math]||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 [/math] where λ is a scalar hyper-parameter. The first term ensures that the learned model explains the ground truth transitions while the second term ensures consistency. In summary, the loss function is given below:

[math]\underset{θ_π θ_f}{min} \bigg( ||x_{t+1} - \widetilde{x}_{t+1}||_2^2 + λ||x_{t+1} - \hat{x}_{t+1}||_2^2 + L(a_t, \hat{a}_t) \bigg)[/math], such that
[math]\widetilde{x}_{t+1} = f(x_t, a_t; θ_f)[/math]
[math]\hat{x}_{t+1} = f(x_t, \hat{a}_t; θ_f)[/math]
[math]\hat{a}_t = π(x_t, x_{t+1}; θ_π)[/math]

Past works have shown that learning forward dynamics in the feature space as opposed to raw observation space is more robust so this paper follows in extending the GSP to make predictions on feature representations denoted [math]\phi(x_t), \phi(x_{t+1})[/math]. The forward consistency loss is then computed to make predictions in the feature space [math]\phi[/math]. Learning [math]θ_π,θ_f[/math] from scratch can cause noisier gradient updates for [math]π[/math]. This is addressed by pre-training the forward model with the first term and GSP seperately by blocking gradient flow. Fine-tuning is then done with [math]θ_π,θ_f[/math] jointly.

The generalization to multi-step GSP [math]π_m[/math] is shown below where [math]\phi[/math] refers to the feature space rather than observation space which was used in the single-step case:

[math]\underset{θ_π, θ_f, θ_{\phi}}{min} \sum_{t=i}^{t=T} \bigg(||\phi(x_{t+1}) - \phi(\widetilde{x}_{t+1})||_2^2 + λ||\phi(x_{t+1}) - \phi(\hat{x}_{t+1})||_2^2 + L(a_t, \hat{a}_t)\bigg)[/math], such that
[math]\phi(\widetilde{x}_{t+1}) = f\big(\phi(x_t), a_t; θ_f\big)[/math]
[math]\phi(\hat{x}_{t+1}) = f\big(\phi(x_t), \hat{a}_t; θ_f\big)[/math]
[math]\phi(\hat{a}_t) = π\big(\phi(x_t), \phi(x_{t+1}); θ_π\big)[/math]

The forward consistency loss is computed at each time step, t, and jointly optimized with the action prediction loss over the whole trajectory. [math]\phi(.)[/math] is represented by a CNN with parameters [math]θ_{\phi}[/math]. The multi-step forward consistent GSP [math] \pi_m[/math] is implemented via a recurrent network with inputs current state, goal states, actions at previous time step and the internal hidden representation denoted [math] h_{t-1}[/math], and outputs the actions to take.

Goal Recognizer

The goal recognizer network was introduced to figure out if the current goal is reached. This allows the agent to take multiple steps between goals without being penalized. In this paper, goal recognition was taken as a binary classification problem; given an observation and the goal, is the observation close to the goal or not.

The goal recognizer was trained on data from the agent's random exploration. Pseudo-goal states were samples from the visited states, and all observations within a few timesteps of these were considered as positive results (close to the goal). The goal classifier was trained using the standard cross-entropy loss.

The authors found that training a separate goal recognition network outperformed simply adding a 'stop' action to the action space of the policy network.

Ablations and Baselines

To summarize, the GSP formulation is composed of (a) recurrent variable-length skill policy network, (b) explicitly encoding the previous action in the recurrence, (c) goal recognizer, (d) forward consistency loss function, and (w) learning forward dynamics in the feature space instead of raw observation space.

To show the importance of each component a systematic ablation (removal) of components for each experiment is done to show the impact on visual imitation. The following methods will be evaluated in the experiments section:

  1. Classical methods: In visual navigation, the paper attempts to compare against the state-of-the-art ORB-SLAM2 and Open-SFM.
  2. Inverse model: Nair et al. (2017) leverage vanilla inverse dynamics to follow demonstration in rope manipulation setup.
  3. GSP-NoPrevAction-NoFwdConst is the removal of the paper's recurrent GSP without previous action history and without forwarding consistency loss.
  4. GSP-NoFwdConst refers to the recurrent GSP with previous action history, but without forwarding consistency objective.
  5. GSP-FwdRegularizer refers to the model where forward prediction is only used to regularize the features of GSP but has no role to play in the loss function of predicted actions.
  6. GSP refers to the complete method with all the components.


The model is evaluated by testing performance on a rope manipulation task using a Baxter Robot, navigation of a TurtleBot in cluttered office environments and simulated 3D navigation in VizDoom. A good skill policy will generalize to unseen environments and new goals while staying robust to irrelevant distractors and observations. For the rope manipulation task this is tested by making the robot tie a knot, a task it did not observe during training. For the navigation tasks, generalization is checked by getting the agents to traverse new buildings and floors.

Rope Manipulation

Rope manipulation is an interesting task because even humans learn complex rope manipulation, such as tying knots, via observing an expert perform it.

In this paper, rope manipulation data collected by Nair et al. (2017) is used, where a Baxter robot manipulated a rope kept on a table in front of it. During this exploration, the robot picked up the rope at a random point and displaced it randomly on the table. 60K interaction pairs were collected of the form [math](x_t, a_t, x_{t+1})[/math]. These were used to train the GSP proposed in this paper.

For this experiment, the Baxter robot is setup exactly like the one presented in Nair et al. (2017). The robot is tasked with manipulating the rope into an 'S' as well as tying a knot as shown in Figure 2. In testing, the robot was only provided with images of intermediate states of the rope, and not the actions taken by the human trainer. The thin plate spline robust point matching technique (TPS-RPM) (Chui & Rangarajan, 2003) is used to measure the performance of constructing the 'S' shape as shown in Figure 3. Visual verification (by a human) was used to assess the tying of a successful knot.

The base architecture consisted of a pre-trained AlexNet whose features were fed into a skill policy network that predicts the location of grasp, the direction of displacement and the magnitude of displacement. All models were optimized using Asam with a learning rate of 1e-4. For the first 40K iterations, the AlexNet weights were frozen and then fine-tuned jointly with the later layers. More details are provided in the appendix of the paper.

The approach of this paper is compared to (Nair et al., 2017) where they did similar experiments using an inverse model. The results in Figure 3 show that for the 'S' shape construction, zero-shot visual imitation achieves a success rate of 60% versus the 36% baseline from the inverse model.

Figure 2: Qualitative visualization of results for rope manipulation task using Baxter robot. (a) The robotics system setup. (b) The sequence of human demonstration images provided by the human during inference for the task of knot-tying (top row), and the sequences of observation states reached by the robot while imitating the given demonstration (bottom rows). (c) The sequence of human demonstration images and the ones reached by the robot for the task of manipulating rope into ‘S’ shape. Our agent is able to successfully imitate the demonstration.
Figure 3: GSP trained using forward consistency loss significantly outperforms the baselines at the task of (a) manipulating rope into 'S' shape as measured by TPS-RPM error and (b) knot-tying where a success rate is reported with bootstrap standard deviation

Navigation in Indoor Office Environments

In this experiment, the robot was shown a single image or multiple images to lead it to the goal. The robot, a TurtleBot2, autonomously moves to the goal. For learning the GSP, an automated self-supervised method for data collection was devised that didn't require human supervision. The robot explored two floors of an academic building and collected 230K interactions [math](x_t, a_t, x_{t+1})[/math] (more detail is provided I the appendix of the paper). The robot was then placed into an unseen floor of the building with different textures and furniture layout for performing visual imitation at test time.

The collected data was used to train a recurrent forward-consistent GSP. The base architecture for the model was an ImageNet pre-trained ResNet-50 network. The loss weight of the forward model is 0.1 and the objective is minimized using Adam with a learning rate of 5e-4. More details on the implementation are given in the appendix of the paper.

Figure 4 shows the robot's observations during testing. Table 1 shows the results of this experiment; as can be seen, GSP fairs much better than all previous baselines.

Figure 4: Visualization of the TurtleBot trajectory to reach a goal image (right) from the initial image (top-left). Since the initial and goal image has no overlap, the robot first explores the environment by turning in place. Once it detects overlap between its current image and goal image (i.e. step 42 onward), it moves towards the goal. Note that we did not explicitly train the robot to explore and such exploratory behavior naturally emerged from the self-supervised learning.
Table 1: Quantitative evaluation of various methods on the task of navigating using a single image of goal in an unseen environment. Each column represents a different run of our system for a different initial/goal image pair. The full GSP model takes longer to reach the goal on average given a successful run but reaches the goal successfully at a much higher rate.

Figure 5 and table 1 show the results for the robot performing a task with multiple waypoints, i.e. the robot was shown multiple sub-goals instead of just one final goal state. This was required when the end goal was far away form the robot, such as in another room. It is good to note that zero-shot visual imitation is robust to a changing environment where every frame need not match the demonstrated frame. This is achieved by providing sparse landmarks.

Figure 5: The performance of TurtleBot at following a visual demonstration given as a sequence of images (top row). The TurtleBot is positioned in a manner such that the first image in the demonstration has no overlap with its current observation. Even under this condition, the robot is able to move closer to the first demo image (shown as Robot WayPoint-1) and then follow the provided demonstration until the end. This also exemplifies a failure case for classical methods; there are no possible keypoint matches between WayPoint-1 and WayPoint-2, and the initial observation is even farther from WayPoint-1.
Table 2: Quantitative evaluation of TurtleBot’s performance at following visual demonstrations in two scenarios: maze and the loop. We report the % of landmarks reached by the agent across three runs of two different demonstrations. Results show that our method outperforms the baselines. Note that 3 more trials of the loop demonstration were tested under significantly different lighting conditions and neither model succeeded. Detailed results are available in the supplementary materials.

3D Navigation in VizDoom

To round off the experiments, a VizDoom simulation environment was used to test the GSP. VizDoom is a Doom-based popular Reinforcement Learning testbed. It allows agents to play the doom game using only a screen buffer. It is a 3D simulation environment that is traditionally considered to be harder than 2D domain like Atari. The goal was to measure the robustness of each method with proper error bars, the role of initial self-supervised data collection and the quantitative difference in modeling forward consistency loss in feature space in comparison to raw visual space.

Data were collected using two methods: random exploration and curiosity-driven exploration (Pathak et al., 2017). The hypothesis here is that better data rather than just random exploration can lead to a better learned GSP. More details on the implementation are given in the paper appendix.

Table 3 shows the results of the VizDoom experiments with the key takeaway that the data collected via curiosity seems to improve the final imitation performance across all methods.

Table 3: Quantitative evaluation of our proposed GSP and the baseline models at following visual demonstrations in VizDoom 3D Navigation. Medians and 95% confidence intervals are reported for demonstration completion and efficiency over 50 seeds and 5 human paths per environment type.


This work presented a method for imitating expert demonstrations from visual observations alone. The key idea is to learn a GSP utilizing data collected by self-supervision. A limitation of this approach is that the quality of the learned GSP is restricted by the exploration data. For instance, moving to a goal in between rooms would not be possible without an intermediate sub-goal. So, future research in zero-shot imitation could aim to generalize the exploration such that the agent is able to explore across different rooms for example.

A limitation of the work in this paper is that the method requires first-person view demonstrations. Extending to the third-person may yield a learning of a more general framework. Also, in the current framework, it is assumed that the visual observations of the expert and agent are similar. When the expert performs a demonstration in one setting such as daylight, and the agent performs the task in the evening, results may worsen.

The expert demonstrations are also purely imitated; that is, the agent does not learn the demonstrations. Future work could look into learning the demonstration so as to richen its exploration techniques.

This work used a sequence of images to provide a demonstration but the work, in general, does not make image-specific assumptions. Thus the work could be extended to using formal language to communicate goals, an idea left for future work. Future work would also explore how multiple tasks can be combined into a single model, where different tasks might come from different contexts. Finally, it would be exciting to explore explicit handling of domain shift in future work, so as to handle large differences in embodiment and learn skills directly from videos of human demonstrators obtained, for example, from the Internet.


[1] D.Pathak, P.Mahmoudieh, G.Luo, P.Agrawal, D.Chen, Y.Shentu, E.Shelhamer, J.Malik, A.A.Efros, and T. Darrell. Zero-shot Visual Imitation. In ICLR, 2018.

[2] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 2009.

[3] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice-hall Englewood Cliffs, NJ, 1977.

[4] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. NIPS, 2016.

[5] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with large-scale data collection. In ISER, 2016.

[6] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. ICRA, 2016.

[7] Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. ICRA, 2017.

[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.