Difference between revisions of "Robust Imitation of Diverse Behaviors"

From statwiki
Jump to: navigation, search
(Added Conclusion)
(Added References)
Line 116: Line 116:
  
 
=References=
 
=References=
 +
# Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems (pp. 5326-5335).
 +
# https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/
 +
# https://www.youtube.com/watch?v=NaohsyUxpxw
 +
# https://www.youtube.com/watch?v=VBrIll0B24o

Revision as of 14:00, 7 March 2018

Introduction

One of the longest standing challenges in AI is building versatile embodied agents, both in the form of real robots and animated avatars, capable of a wide and diverse set of behaviors. State-of-the-art robots cannot compete with the effortless variety and adaptive flexibility of motor behaviors produced by toddlers. Towards addressing this challenge, the authors combine several deep generative approaches to imitation learning in a way that accentuates their individual strengths and addresses their limitations. The end product is a robust neural network policy that can imitate a large and diverse set of behaviors using few training demonstrations.

Motivation

Some of the models that have recently shown great promise in imitation learning for motor control are the deep generative models. The authors primarily talk about two approaches viz. supervised approaches that condition on demonstrations and Generative Adversarial Imitation Learning (GAIL) and their limitations and try to combine those two approaches in order to address these limitations. Some of these limitations are as follows:

  • Supervised approaches that condition on demonstrations (VAE):
    • They require large training datasets in order to work for non-trivial tasks
    • They tend to be brittle and fail when the agent diverges too much from the demonstration trajectories
  • Generative Adversarial Imitation Learning (GAIL)
    • Adversarial training leads to mode-collapse (the tendency of adversarial generative models to cover only a subset of modes of a probability distribution, resulting in a failure to produce adequately diverse samples)
    • More difficult and slow to train as they do not immediately provide a latent representation of the data

Thus, the former approach can model diverse behaviors without dropping modes but does not learn robust policies, while the latter approach gives robust policies but insufficiently diverse behaviors. Thus, the authors combine the favorable aspects of these two approaches. The base of their model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. Leveraging these policy representations, they develop a new version of GAIL that

  1. is much more robust than the purely-supervised controller, especially with few demonstrations, and
  2. avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not.

Model

The authors first introduce a variational autoencoder (VAE) for supervised imitation, consisting of a bi-directional LSTM encoder mapping demonstration sequences to embedding vectors, and two decoders. The first decoder is a multi-layer perceptron (MLP) policy mapping a trajectory embedding and the current state to a continuous action vector. The second is a dynamics model mapping the embedding and previous state to the present state while modeling correlations among states with a WaveNet.

Model Architecture.png

Behavioral cloning with VAE suited for control

In this section, the authors follow a similar approach to Duan et al., but opt for stochastic VAEs as having a distribution [math]q_\phi(z|x_{1:T})[/math] to better regularize the latent space. In their VAE, an encoder maps a demonstration sequence to an embedding vector [math]z[/math]. Given [math]z[/math], they decode both the state and action trajectories as shown in the figure above. To train the model, the following loss is minimized:

\begin{align} L\left( \alpha, w, \phi; \tau_i \right) = - \pmb{\mathbb{E}}_{q_{\phi}(z|x_{1:T_i}^i)} \left[ \sum_{t=1}^{T_i} log \pi_\alpha \left( a_t^i|x_t^i, z \right) + log p_w \left( x_{t+1}^i|x_t^i, z\right) \right] +D_{KL}\left( q_\phi(z|x_{1:T_i}^i)||p(z) \right) \end{align}

The encoder [math]q[/math] uses a bi-directional LSTM. To produce the final embedding, it calculates the average of all the outputs of the second layer of this LSTM before applying a final linear transformation to generate the mean and standard deviation of a Gaussian. Then, one sample from this Gaussian is taken as the demonstration encoding.

The action decoder is an MLP that maps the concatenation of the state and the embedding of the parameters of a Gaussian policy. The state decoder is similar to a conditional WaveNet model. In particular, it conditions on the embedding [math]z[/math] and previous state [math]x_{t-1}[/math] to generate the vector [math]x_t[/math] autoregressively. That is, the autoregression is over the components of the vector [math]x_t[/math]. Finally, instead of a Softmax, the model uses a mixture of Gaussians as the output of the WaveNet.

Diverse generative adversarial imitiation learning

To enable GAIL to produce diverse solutions, the authors condition the discriminator on the embeddings generated by the VAE encoder and integrate out the GAIL objective with respect to the variational posterior [math]q_\phi(z|x_{1:T})[/math]. Specifically, the authors train the discriminator by optimizing the following objective:

\begin{align} {max}_{\psi} \pmb{\mathbb{E}}_{\tau_i \sim \pi_E} \left( \pmb{\mathbb{E}}_{q(z|x_{1:T_i}^i)} \left[\frac{1}{T_i} \sum_{t=1}^{T_i} logD_{\psi} \left( x_t^i, a_t^i | z \right) + \pmb{\mathbb{E}}_{\pi_\theta} \left[ log(1 - D_\psi(x, a | z)) \right] \right] \right) \end{align}

The authors condition on unlabeled trajectories, which have been passed through a powerful encoder, and hence this approach is capable of one-shot imitation learning. Moreover, the VAE encoder enables to obtain a continuous latent embedding space where interpolation is possible.

To better motivate the objective, the authors propose on temporarily leaving the context of imitation learning and considering an alternative objective for training GANs

\begin{align} {min}_{G}{max}_{D} V (G, D) = \int_{y} p(y) \int_{z} q(z|y) \left[ log D(y | z) + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dy dz \end{align}

This function is a simplification of the previous objective function. Furthermore, it satisfies the following property.

Lemma 1

Assuming that [math]q[/math] computes the true posterior distribution that is [math]q(z|y) = \frac{p(y|z)p(z)}{p(y)}[/math] then

\begin{align} V (G, D) = \int_{z} p(z) \left[ \int_{y} p(y|z) log D(y|z) dy + \int_{\hat{y}} G(\hat{y} | z) log (1 - D(\hat{y} | z)) d\hat{y} \right] dz \end{align}

If an optimal discriminator is further assumed, the cost-optimized by the generator then becomes

\begin{align} C(G) = 2 \int_ p p(z) JSD[p(\cdot|z) || G(\cdot|z)] dz - log4 \end{align}

where [math]JSD[/math] stands for the Jensen-Shannon divergence.

Experiments

The primary focus of the paper's experimental evaluation is to demonstrate that the architecture allows learning of robust controllers capable of producing the full spectrum of demonstration behaviors for a diverse range of challenging control problems. The authors consider three bodies: a 9 DoF robotic arm, a 9 DoF planar walker, and a 62 DoF complex humanoid (56-actuated joint angles, and a freely translating and rotating 3d root joint). While for the reaching task BC is sufficient to obtain a working controller, for the other two problems the full learning procedure is critical.

The authors analyze the resulting embedding spaces and demonstrate that they exhibit a rich and sensible structure that can be exploited for control. Finally, the authors show that the encoder can be used to capture the gist of novel demonstration trajectories which can then be reproduced by the controller.

Robotic arm reaching

In this experiment, the authors demonstrate the effectiveness of their VAE architecture and investigate the nature of the learned embedding space on a reaching task with a simulated Jaco arm.

To obtain demonstrations, the authors trained 60 independent policies to reach to random target locations in the workspace starting from the same initial configuration. 30 trajectories from each of the first 50 policies were generated. These served as training data for the VAE model (1500 training trajectories in total). The remaining 10 policies were used to generate test data.

The reaching task is relatively simple, so with this amount of data the VAE policy is fairly robust. After training, the VAE encodes and reproduces the demonstrations as shown in the figure below.

Robotic arm reaching.png

2D Walker

As a more challenging test compared to the reaching task, the authors consider bipedal locomotion. Here, the authors train 60 neural network policies for a 2d walker to serve as demonstrations. These policies are each trained to move at different speeds both forward and backward depending on a label provided as additional input to the policy. Target speeds for training were chosen from a set of four different speeds (m/s): -1, 0, 1, 3.

2D Walker.png

The authors trained their model with 20 episodes per policy (1200 demonstration trajectories in total, each with a length of 400 steps or 10s of simulated time). In this experiment a full approach is required: training the VAE with BC alone can imitate some of the trajectories, but it performs poorly in general, presumably because the relatively small training set does not cover the space of trajectories sufficiently densely. On this generated dataset, they also train policies with GAIL using the same architecture and hyper-parameters. Due to the lack of conditioning, GAIL does not reproduce coherently trajectories. Instead, it simply meshes different behaviors together. In addition, the policies trained with GAIL also, exhibit dramatically less diversity; see video.

In the left panel, the planar walker demonstrates a particular walking style. In the right panel, the model's agent imitates this walking style using a single policy network.

Complex humanoid

For this experiment, the authors consider a humanoid body of high dimensionality that poses a hard control problem. They generate training trajectories with the existing controllers, which can produce instances of one of six different movement styles. Examples of such trajectories are shown in Fig. 5.

Complex humanoid.png

The training set consists of 250 random trajectories from 6 different neural network controllers that were trained to match 6 different movement styles from the CMU motion capture database. Each trajectory is 334 steps or 10s long. The authors use a second set of 5 controllers from which they generate trajectories for evaluation (3 of these policies were trained on the same movement styles as the policies used for generating training data).

Surprisingly, despite the complexity of the body, supervised learning is quite effective at producing sensible controllers: The VAE policy is reasonably good at imitating the demonstration trajectories, although it lacks the robustness to be practically useful. Adversarial training dramatically improves the stability of the controller. The authors analyze the improvement quantitatively by computing the percentage of the humanoid falling down before the end of an episode while imitating either training or test policies. The results are summarized in Figure 5 right. The figure further shows sequences of frames of representative demonstration and associated imitation trajectories. Videos of demonstration and imitation behaviors can be found in the video.

In the left and middle panels we show two demonstrated behaviors. In the right panel, the model's agent produces an unseen transition between those behaviors.

Also, for practical purposes, it is desirable to allow the controller to transition from one behavior to another. The authors test this possibility in an experiment similar to the one for the Jaco arm: They determine the embedding vectors of pairs of demonstration trajectories, start the trajectory by conditioning on the first embedding vector, and then transition from one behavior to the other half-way through the episode by linearly interpolating the embeddings of the two demonstration trajectories over a window of 20 control steps. Although not always successful the learned controller often transitions robustly, despite not having been trained to do so. An example of this can be seen in the gif above. More examples can be seen in this video.

Conclusions

The authors have proposed an approach for imitation learning that combines the favorable properties of techniques for density modeling with latent variables (VAEs) with those of GAIL. The result is a model that learns, from a moderate number of demonstration trajectories

  1. a semantically well structured embedding of behaviors,
  2. a corresponding multi-task controller that allows to robustly execute diverse behaviors from this embedding space, as well as
  3. an encoder that can map new trajectories into the embedding space and hence allows for one-shot imitation.

The experimental results demonstrate that this approach can work on a variety of control problems and that it scales even to very challenging ones such as the control of a simulated humanoid with a large number of degrees of freedoms.

Critique

References

  1. Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G., & Heess, N. (2017). Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems (pp. 5326-5335).
  2. https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/
  3. https://www.youtube.com/watch?v=NaohsyUxpxw
  4. https://www.youtube.com/watch?v=VBrIll0B24o