Visual Reinforcement Learning with Imagined Goals: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Technical)
 
(97 intermediate revisions by 27 users not shown)
Line 1: Line 1:
[Need add more pics and references]
Video and details of this work are available [https://sites.google.com/site/visualrlwithimaginedgoals/ here]
 
=Introduction and Motivation=
=Introduction and Motivation=


Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. If I were dropped in the middle of Moscow, simply by walking around in an undirected manner, I could accomplish a specific task (ex. go to the grocery store) without ever having seen this task before simply by knowing where the store was located from past experience. We are able to set our own goals and learn from our experiences, and thus able to accomplish specific tasks without ever having been trained explicitly for them, a core principle of generalization.
Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus are able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.


Naturally, the next question for any machine learning scientist is: can an autonomous agent also set its own goals and learn from its environment. In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. Specifically, they introduce a system that sets abstract goals and autonomously learns to achieve those goals. They then show that they can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals involving pushing an object to a specific location, with only images as the input to the system.
In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract (self-generated) goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.
 
The algorithm proposed by the authors is summarized below. A Variational Auto Encoder (VAE) on the (left) learns a latent representation of images gathered during training time (center). These latent variables are used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).
 
[[File: WF_Sec_11Nov25_01.png |center| 800px]]


=Related Work =
=Related Work =
Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviors such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al. [11] proposed time-varying models which require episodic setups and thus are hard to generalize to non-episodic and continuous learning scenarios. There are also other works such as Pinto et al. [12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] use fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state [https://www.princeton.edu/~yael/Publications/DayanNiv2008.pdf Reinforcement learning: The Good, The Bad and The Ugly].), but does not learn goal-conditioned skills. The authors' experiments indicate that this technique is difficult to extend to goal-conditioned setting
with image inputs. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.
In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions. This approach allows for a single transition tuple to be converted into potentially infinite valid training examples.
Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.


=Goal-Conditioned Reinforcement Learning=
=Goal-Conditioned Reinforcement Learning=


The ultimate directive in reinforcement learning is to learn a policy, that when given a state and goal, can dictate the optimal action. In this paper, goals are not explicitly defined during training. If a goal is not explicity defined, a set of synthetic goals must be autogenerated by the agent. Thus, suppose we let an autonomous agent explore an environment with a random policy. After executing each action, state observations are collected and stored. These state observations are structured in the form of images. The agent can randomly select goals from the set of state observations, and can also randomly select initial states from the set of state observations.
The ultimate goal in reinforcement learning is to learn a policy <math>\pi</math>, that when given a state <math>s_t</math> and goal <math>g</math> (desired state), can dictate the optimal action <math>a_t</math>. The optimal action <math>a_t</math> is defined as an action which maximizes the expected return denoted by <math>R_t</math> and defined as <math>R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i]</math>, where <math>r_i = r(s_i, a_i, s_{i+1})</math> is the reward for performing action <math>a_i</math> when the current state is <math>s_i</math> and the goal state is <math>s_{i+1}</math> and <math>\gamma</math> is a discount factor which determines the relative importance given to rewards at different times.
 
In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Suppose we let an autonomous agent explore an environment with a random policy. After executing each action, start and stop state observations are collected and stored. All state observations are images. For training, the agent can randomly select starting states and goals images from the set of state observations.
 
Moreover, if we aim to accomplish a variety of tasks, we can construct a goal-conditioned policy and reward, and optimize the expected return with respect to a goal distribution


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:human-giving-goal.png|400px]]</div>
<center><math>E_{g \sim G}[E_{r_i,s_i \sim E, a_i \sim \pi}[R_0]]</math></center>


Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that the value function is maximized. However, to implement such a framework, we need to define a reward function. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to a goal state.
where <math>G</math> is the set of goals and the reward is also a function of <math>g</math>


We can train a single policy to maximize rewards and therefore reach goal states by first learning a goal-conditioned Q function. A goal-conditioned Q function Q(s,a,g) tells us how good an action a is, given the current state s and goal g. For example, a Q function tells us, “How good is it to move my hand up (action a), if I’m holding a plate (state s) and want to put the plate on the table (goal g)?” Once this Q function is trained, you can extract a goal-conditioned policy by performing the following optimization
Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that a chosen value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to the goal state.


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:policy-extraction.png|400px]]</div>
[[File:human-giving-goal.png|center|thumb|400px|The task: Make the world look like this image. [9]]]


which effectively says, “choose the best action according to this Q function.” By using this procedure, we obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.
In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function <math>Q(s,a,g)</math> tells us how good an action <math>a</math> is, given the current state <math>s</math> and goal <math>g</math>. For example, a Q-function tells us, “How good is it to move my hand up (action <math>a</math>), if I’m holding a plate (state <math>s</math>) and want to put the plate on the table (goal <math>g</math>)?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization


One reason that Q learning is popular is that in can be done in an off-policy manner, meaning that the only things we need to train our Q function are samples of state, action, next state, goal, and reward: (s,a,s′,g,r). This data can be collected by any policy and can be reused across multiples tasks. So a simple goal-conditioned Q-learning algorithm looks like this:
<div align="center">
<math>\pi(s,g) = max_a Q(s,a,g)</math>
</div>


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:ql.png|400px]]</div>
which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.


The main bottleneck in this training procedure is collecting data. If we could artificially generate more data, we could in theory learn to solve various tasks without even interacting with the world. Unfortunately, learning an accurate model of the world is difficult, so we usually have to rely on sampling to get state-action-next-state data, (s,a,s′). However, if we have access to the reward function r(s,g), we can retroactively relabeled goals and recompute rewards, allowing us to artificially generate more data given a single (s,a,s′) tuple. So, we can modify this training procedure like so:
The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward <math>(s,a,s′,g,r)</math>. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:qlr.png|400px]]</div>
[[File:ql.png|center|600px]]


The nice thing about this goal resampling is that we can simultaneously learn how to reach multiple goals at once without needing more data from the environment. Overall, this simple modification can result in substantially faster learning.
From the tuple <math>(s,a,s',g,r)</math>, an approximate Q-function paramaterized by <math>w</math> can be trained by minimizing the Bellman error:


The method outlined above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution p(g). Prior works that use this goal relabeling strategy ( Kaelbling ‘93 , Andrychowicz ‘17 , Pong ‘18 ) operate on ground truth state information (e.g., the Cartesian position of an object), where it is easy to manually design both the goal distribution p(g) and reward function. However, when moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns.
<div align="center">
<math>\mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 </math>
</div>


For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. Images are noisy. A large amount of information in an image that may not be related to the object we analyze. Thus, the distance between two images may not correlate with their semantic distance.
where <math>\overline{w}</math> is treated as some constant.


Second, because our goals are images, we need a goal image distribution p(g) from which we can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. Instead, we would like our agent to autonomously imagine its own goals and learn how to reach them.
The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling is usually performed to get state-action-next-state data, <math> (s,a,s′)</math> . However, if the reward function <math>r(s,g)</math> can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single <math>(s,a,s′)</math> tuple. As a result, the training procedure can be modified like so:


=Variational Autoencoder (VAE)=
[[File:qlr.png|center|600px]]
An autoencoder is a type of machine learning model that can learn to extract a robust, space-efficient feature vector from an image. The model has two parts — an encoder (e) and a decoder (p). The encoder takes as an input the image, and outputs a low-dimensional feature vector. The decoder takes as an input this low-dimensional feature vector, and recreates the original shape.


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:1 1OabPemOWCLrpCwIUmIsCg.png|400px]]</div>
This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution <math>p(g)</math>. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns, as the task of generating goal images is fairly intensive.


This generative model converts high-dimensional observations x, like images, into low-dimensional latent variables z, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image, similar to the abstract representations a human may use to interpret the world and goals. Given a current image x and goal image xg, we convert them into latent variables z and zg respectively. We then use these latent variables to representation the state and goal for our reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images results in faster learning.
For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. But images are noisy and a large amount of information in an image may not be related to the object we analyze. Thus, the distance between the two images may not correlate with their semantic distance.


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:robot-interpreting-scene.png|400px]]</div>
Second, because the goals are images, a goal image distribution <math>p(g)</math> is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.


Using the latent variable representations for the images and goals also solves another problem: how to compute rewards. Rather than using pixel-wise error as our reward, we use the distance in the latent space for the reward to train our agent to reach a goal. In the full research paper describing our method, we show that this corresponds to maximizing the probability of reaching the goal and provides a much more effective learning signal.
Retroactively generating goals is also explored in tabular domains in [15]and in continuous domains in [14] using hindsight experience replay (HER). However, HER is
limited to sampling goals seen along a trajectory, which greatly limits the number and diversity of goals with which one can relabel a given transition.


This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, our generative model is designed so that sampling latent variables is trivial: we just sample latents from the VAE prior. We use this sampling mechanism for two reasons: First, it provides a mechanism for an agent set its own goals. The agent simply samples a value for the latent variable from our generative model, and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Because our generative model is trained to encode real images into the prior, the samples from our latent variable prior correspond to meaningful latent goals.
=Variational Autoencoder=
Variational autoencoders can learn structured latent representations of high dimensional data. VAE contains an encoder <math>p_\phi</math> and a decoder <math>p_\psi</math>. The former maps states to latent distributions, while the later maps latents to distributions over states. these two are jointly trained to maximize:


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:robot-imagining-goals.png|400px]]</div>
<math>L(\psi,\phi;s^{(i)})=-\beta D_{KL}(q_\phi(z|s^{(i)}||p(z))+E_{q\phi(z|s^(i))}[log p_\psi(s^{(i)})|z])</math>


All together, the latent variable representation of images (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism, allowing us to efficiently train a goal-conditioned reinforcement learning agent that operates directly on pixels. We call the overall method reinforcement learning with imagined goals (RIG).
where p(z) is a prior distribution, which is chosen to be unit Gaussian, <math>D_{KL}</math> is the Kullback-Leibler divergence, and <math>\beta</math> is a hyper-parameter that balances the two terms.


=Experiments=
This generative model
converts high-dimensional observations <math>x</math>, like images, into low-dimensional latent variables <math>z</math>, and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image <math>x</math> and goal image <math>x_g</math> can be converted into latent variables <math>z</math> and <math>z_g</math>, respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images result in faster learning.


We conducted experiments to test if we RIG would be sample-efficient enough to train a real world robot policy in a reasonable amount of time. We tested the robot’s ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space. By setting its own goals, the robot can autonomously practice reaching different positions without human involvement. The only human involvement is when a person wants the robot to perform a specific task. At this point, the robot is given a goal image. Because the robot has practiced reaching so many goals, we see that it is able to reach this goal without additional training:
[[File:robot-interpreting-scene.png|center|thumb|600px|The agent encodes the current image (<math>x</math>) and goal image (<math>x_g</math>) into a latent space and use distances in that latent space for reward. [9]]]


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:reaching.JPG|400px]]</div>
Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal <math>z_g</math>.


We also used RIG to train a policy to push objects to target locations:
This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:pushing.JPG|400px]]</div>
[[File:robot-imagining-goals.png|center|thumb|600px|Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]]]


Training a policy directly from images makes it easy to change tasks from reaching to object pushing. We simply added an object, added a table, and adjusted the camera. Lastly, despite working directly from pixels, these experiments did not take long to run. The reaching results took about an hour, while the pushing results took about 4.5 hours of real-robot interaction time. Many real-world robot reinforcement learning results use ground-truth state information like the position of an object. However, this usually requires additional machinery, like purchasing and setting up extra sensors or training an object-detection system. In contrast, our method only requires an RGB camera and works directly from the images.
The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors.
The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal <math>z_g</math> from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.
=Goal-Conditioned Policies with Unsupervised Representation Learning=
The choice of a suitable goal representation is required for the devising of practical goal-conditioned value functions. When there is absence of domain specific knowledge and instrumentation, a choice is to set the goal space G to be the same as the state observation space S. However, when state is high-dimensional learning a goal-conditioned Q-function and policy becomes exceedingly difficult. One challenging problem with end-to-end approaches for visual RL tasks is that the resulting policy needs to learn both perception and control. Training the goal-conditioned value function requires defining a goal-conditioned reward.


=Performance Evaluation Strategy=
Their method jointly addresses a number of problems that arise when working with high-dimensional
inputs such as images: sample efficient learning, reward specification, and automated goal-setting. These problems are addressed by learning a latent embedding using a <math>/beta - VAE</math>. This latent space is then used to represent the goal and state and retroactively relabel data with latent goals sampled from the VAE prior to improve sample efficiency.
=Algorithm=
[[File:algorithm1.png|center|thumb|600px|]]


The evaluation in this particular task is very tricky. The reason is we are evaluating neural network here. In order to evaluate it, we need to train it first. And we are doing pixel level classification on images with high resolutions, so the naive approach would require a tremendous amount of computational resources.
Algorithm 1 is called reinforcement learning with imagined goals (RIG). The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space using the equation 


The way they solve it in the paper is defining a proxy task. The proxy task is a task that requires sufficient less computational resources, while can still give a good estimate of the performance of the network. In most image classical tasks of NAS, the proxy
<center><math display="inline"> r(s, g) = - || z - z_g ||_A \propto \sqrt{log(e_{\Phi}(z_g | s))} </math></center>.
task is to train the network on images of lower resolution. The assumption is, if the network performs well on images with lower density, it should reasonably perform well on images with higher resolution.


However, the above approach does not work on this case. The reason is that the dense prediction tasks innately require high-resolution images as training data. The approach used in the paper is the flowing:
This equation is derived from the equation below. This is based on the choice to use the negative Mahalanobis distance in the latent space for the reward:
<ol>
<li> Use a smaller backbone for proxy task</li>
<li> caching the feature maps produced by the network backbone on the training set and directly building a single DPC on top of it  </li>
<li> Early stopping train for 30k iterations with a batch size of 8</li>
</ol>


If training on the large-scale backbone without fixing the weights of the backbone, they would need one week to train a network on a P100 GPU, but now they cut down the proxy task to be run 90 min. Then they rank the selected architectures, choosing the top 50 and do
<center><math display="inline"> r(s, g) = - || e(s) - e(g) ||_A = - || z - z_g ||_A </math></center>
a full evaluation on it.


The evaluation metric they used is called mIOU, which is pixel level intersection over union. Which just the area of the intersection
=Experiments=
of the ground truth and the prediction over the area of the union of the ground truth and the prediction.


=Result=
The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.


This method achieves state of art performances in many datasets. The following table quantifies the gain on performance on many datasets.
The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The Oracle, as a baseline, was given true object location information, as opposed to visual pixel information.


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-10 at 6.51.14 PM.png| 800px]]
[[File:WF_Sec_11Nov_25_02.png|1000px]]
</div>
The chose to train on modified Xception network as a backbone, and the following are the resulting architecture for the DPC.


<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">[[File:Screen Shot 2018-11-12 at 12.32.05 PM.png|1000px]]
</div>


Table 2 describes the results on scene parsing dataset. It sets a new state-of-the-art performance of 82.7% mIOU and outperforms other state-of-the-art models across 11 of the 19 categories.
They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.


Table 3 describes the results on person part segmentation dataset. It achieve the state-of-the-art performance of 71.34% mIOU and outperforms other state-of-the-art models across 6 of the 7 categories.
Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:


Table 4 describes the results on semantic image segmentation dataset. It achieve the state-of-the-art performance of 87.9% mIOU and outperforms other state-of-the-art models across 6 of the 20 categories.
[[File:reaching.JPG|center|thumb|600px|(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.]]


As we can see, the searched DPC model achieves better performance (measured by mIOU) with less than half of the computational resources(parameters), and 37% less of operations (add and multiply).
The method for reaching only needs 10,000 samples and an hour of real-world interactions.


=Future work=
They also used RIG to train a policy to push objects to target locations:
The author suggests that when increasing the number of branches in the DPC, there might be a further gain on the performance on the
image segmentation task. However, although the random search in an exponentially growing space may become more challenging. There may need more intelligent search strategy.


The author hope that this architecture search techniques can be ported into other domains such as depth prediction and object detection to achieve similar gains over human-invented designs.
[[File:pushing.JPG|center|thumb|600px|The robot pushing setup is
pictured, with frames from test rollouts of the learned policy.]]


=Critique=
The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward. As learning proceeds, RIG makes steady progress at optimizing the latent distance.


1. Rich man's game
=Conclusion & Future Work=


The technique described in the paper can only be applied by parties with abundant computational resources, like Google, Facebook, Microsoft, and e.t.c. For small research groups and companies, this method is not that useful due to the lack of computational power. Future improvement will be needed on the design an even more efficient proxy task that can tell whether a network will perform
In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. A new paper [10] was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.
well that requires fewer computations.  


2. Benefit/Cost ratio
=Critique=
1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.


The technique here does outperform human designed network in many cases, but the gain is not huge. In Cityscapes dataset, the performance gain is 0.7%, wherein PASCAL-Person-Part dataset, the gain is 3.7%, and the PASCAL VOC 2012 dataset, it does not outperform human experts. (All measured by mIOU) Even though the push of the state-of-the-art is always something that worth celebrating,
2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blurry to be used goal images. It would be better if this can be investigated in the future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information.  
but in practice, one would argue after spending so many resources doing the search, the computer should achieve superhuman performance. (Like Chess Engine vs Chess Grand Master). In practice, one may simply go with the current state-of-the-art model to avoid the expensive search cost.


3. Still Heavily influenced by Human Bias
3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.


When we define the search space, we introduced human bias. Firstly, the network backbone is chosen from previous matured architectures, which may not actually be optimal. Secondly, the internal branches in the DPC also consist with layers whose operations are defined by us humans, and we define these operations based on previous experience. That also prevents the search algorithm to find something revolutionary.
4. The instability mentioned in #2 is even more apparent in the multi-object scenario and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spend moving between the multiple objects in the scene (which it currently does quite frequently).


4. May have the potential to take away entry-level data science jobs.
=References=
1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric
Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.


If there is a significant reduction in the search cost, it will be more cost effective to apply NAS rather than hire data scientists. Once matured, this technology will have the potential to take away entry-level data science jobs and make data science jobs only possessed by high-level researchers.  
2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by
Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems
(NIPS), 2016.


There are some real-world applications that already deploy NAS techniques in production. Two good examples are Google AutoML and Microsoft Custom Vision AI.
3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan
[9, 10]
Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International
Conference on Learning Representations (ICLR), 2018.


=References=
4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
1. Searching For Efficient Multi-Scale Architectures For Dense Image Prediction, [[https://arxiv.org/abs/1809.04184]].
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International
Conference on Learning Representations (ICLR), 2016.


2. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv:1802.01548, 2018.
5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew
Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement
learning. International Conference on Machine Learning (ICML), 2017.


3. C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In ECCV, 2018.
6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning
Networks. In International Conference on Machine Learning (ICML), 2018.


4. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.
7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey
Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888,
2017.


5. Neural Architecture Search: A Survey [[https://arxiv.org/abs/1808.05377]]
8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted
Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.


6. Deep Residual Learning for Image Recognition [[https://arxiv.org/pdf/1512.03385.pdf]]
9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/


7. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
10. https://arxiv.org/pdf/1811.07819.pdf
In the implementation wise, they used a Google vizier, which is a search tool for black box optimization. [D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In SIGKDD, 2017.]


8. Github implementation of Google Vizer, a black-box optimization tool [https://github.com/tobegit3hub/advisor.]
11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.


9. AutoML: https://cloud.google.com/automl/
12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.


10. Custom-vision: https://azure.microsoft.com/en-us/services/cognitive-services/custom-vision-service/
13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.


11. J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.
14. Marcin Andrychowicz,  Filip Wolski,  Alex Ray,  Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mcgrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In
Advances in Neural Information Processing Systems (NIPS) 2017.


12. M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. Denseaspp for semantic segmentation in street scenes. In CVPR, 2018.
15. L P Kaelbling. Learning to achieve goals. In IJCAI-93. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, volume vol.2, pages 1094 – 8, 1993.

Latest revision as of 21:47, 11 December 2018

Video and details of this work are available here

Introduction and Motivation

Humans are able to accomplish many tasks without any explicit or supervised training, simply by exploring their environment. We are able to set our own goals and learn from our experiences, and thus are able to accomplish specific tasks without ever having been trained explicitly for them. It would be ideal if an autonomous agent can also set its own goals and learn from its environment.

In the paper “Visual Reinforcement Learning with Imagined Goals”, the authors are able to devise such an unsupervised reinforcement learning system. They introduce a system that sets abstract (self-generated) goals and autonomously learns to achieve those goals. They then show that the system can use these autonomously learned skills to perform a variety of user-specified goals, such as pushing objects, grasping objects, and opening doors, without any additional learning. Lastly, they demonstrate that their method is efficient enough to work in the real world on a Sawyer robot. The robot learns to set and achieve goals with only images as the input to the system.

The algorithm proposed by the authors is summarized below. A Variational Auto Encoder (VAE) on the (left) learns a latent representation of images gathered during training time (center). These latent variables are used to train a policy on imagined goals (center), which can then be used for accomplishing user-specified goals (right).

Related Work

Many previous works on vision-based deep reinforcement learning for robotics studied a variety of behaviors such as grasping [1], pushing [2], navigation [3], and other manipulation tasks [4]. However, their assumptions on the models limit their suitability for training general-purpose robots. Some previous works such as Levine et al. [11] proposed time-varying models which require episodic setups and thus are hard to generalize to non-episodic and continuous learning scenarios. There are also other works such as Pinto et al. [12] that proposed an approach using goal images, but it requires instrumented training simulations. Lillicrap et al. [13] use fully model-free training (Model-based RL uses experience to construct an internal model of the transitions and immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state Reinforcement learning: The Good, The Bad and The Ugly.), but does not learn goal-conditioned skills. The authors' experiments indicate that this technique is difficult to extend to goal-conditioned setting with image inputs. There are currently no examples that use model-free reinforcement learning for learning policies to train on real-world robotic systems without having ground-truth information.

In this paper, the authors utilize a goal-conditioned value function to tackle more general tasks through goal relabelling, which improves sample efficiency. Goal relabelling is to retroactively relabel samples in the replay buffer with goals sampled from the latent representation. The paper uses sample random goals from learned latent space to use as replay goals for off-policy Q-learning rather than restricting to states seen along the sampled trajectory as was done in the earlier works. Specifically, they use a model-free Q-learning method that operates on raw state observations and actions. This approach allows for a single transition tuple to be converted into potentially infinite valid training examples.

Unsupervised learning has been used in a number of prior works to acquire better representations of reinforcement learning. In these methods, the learned representation is used as a substitute for the state for the policy. However, these methods require additional information, such as access to the ground truth reward function based on the true state during training time [5], expert trajectories [6], human demonstrations [7], or pre-trained object-detection features [8]. In contrast, the authors learn to generate goals and use the learned representation to get a reward function for those goals without any of these extra sources of supervision.

Goal-Conditioned Reinforcement Learning

The ultimate goal in reinforcement learning is to learn a policy [math]\displaystyle{ \pi }[/math], that when given a state [math]\displaystyle{ s_t }[/math] and goal [math]\displaystyle{ g }[/math] (desired state), can dictate the optimal action [math]\displaystyle{ a_t }[/math]. The optimal action [math]\displaystyle{ a_t }[/math] is defined as an action which maximizes the expected return denoted by [math]\displaystyle{ R_t }[/math] and defined as [math]\displaystyle{ R_t = \mathbb{E}[\sum_{i = t}^T\gamma^{(i-t)}r_i] }[/math], where [math]\displaystyle{ r_i = r(s_i, a_i, s_{i+1}) }[/math] is the reward for performing action [math]\displaystyle{ a_i }[/math] when the current state is [math]\displaystyle{ s_i }[/math] and the goal state is [math]\displaystyle{ s_{i+1} }[/math] and [math]\displaystyle{ \gamma }[/math] is a discount factor which determines the relative importance given to rewards at different times.

In this paper, goals are not explicitly defined during training. If a goal is not explicitly defined, the agent must be able to generate a set of synthetic goals automatically. Suppose we let an autonomous agent explore an environment with a random policy. After executing each action, start and stop state observations are collected and stored. All state observations are images. For training, the agent can randomly select starting states and goals images from the set of state observations.

Moreover, if we aim to accomplish a variety of tasks, we can construct a goal-conditioned policy and reward, and optimize the expected return with respect to a goal distribution

[math]\displaystyle{ E_{g \sim G}[E_{r_i,s_i \sim E, a_i \sim \pi}[R_0]] }[/math]

where [math]\displaystyle{ G }[/math] is the set of goals and the reward is also a function of [math]\displaystyle{ g }[/math]

Now given a set of all possible states, a goal, and an initial state, a reinforcement learning framework can be used to find the optimal policy such that a chosen value function is maximized. However, to implement such a framework, a reward function needs to be defined. One choice for the reward is the negative distance between the current state and the goal state, so that maximizing the reward corresponds to minimizing the distance to the goal state.

The task: Make the world look like this image. [9]

In reinforcement learning, a goal-conditioned Q-function can be used to find a single policy to maximize rewards and therefore reach goal states. A goal-conditioned Q-function [math]\displaystyle{ Q(s,a,g) }[/math] tells us how good an action [math]\displaystyle{ a }[/math] is, given the current state [math]\displaystyle{ s }[/math] and goal [math]\displaystyle{ g }[/math]. For example, a Q-function tells us, “How good is it to move my hand up (action [math]\displaystyle{ a }[/math]), if I’m holding a plate (state [math]\displaystyle{ s }[/math]) and want to put the plate on the table (goal [math]\displaystyle{ g }[/math])?” Once this Q-function is trained, a goal-conditioned policy can be obtained by performing the following optimization

[math]\displaystyle{ \pi(s,g) = max_a Q(s,a,g) }[/math]

which effectively says, “choose the best action according to this Q-function.” By using this procedure, one can obtain a policy that maximizes the sum of rewards, i.e. reaches various goals.

The reason why Q-learning is popular is that it can be trained in an off-policy manner. Therefore, the only things a Q-function needs are samples of state, action, next state, goal, and reward [math]\displaystyle{ (s,a,s′,g,r) }[/math]. This data can be collected by any policy and can be reused across multiples tasks. So a preliminary goal-conditioned Q-learning algorithm looks like this:

From the tuple [math]\displaystyle{ (s,a,s',g,r) }[/math], an approximate Q-function paramaterized by [math]\displaystyle{ w }[/math] can be trained by minimizing the Bellman error:

[math]\displaystyle{ \mathcal{E}(w) = \frac{1}{2} || Q_w(s,a,g) -(r + \gamma \max_{a'} Q_{\overline{w}}(s',a',g)) ||^2 }[/math]

where [math]\displaystyle{ \overline{w} }[/math] is treated as some constant.

The main drawback in this training procedure is collecting data. In theory, one could learn to solve various tasks without even interacting with the world if more data are available. Unfortunately, it is difficult to learn an accurate model of the world, so sampling is usually performed to get state-action-next-state data, [math]\displaystyle{ (s,a,s′) }[/math] . However, if the reward function [math]\displaystyle{ r(s,g) }[/math] can be accessed, one can retroactively relabel goals and recompute rewards. This way, more data can be artificially generated given a single [math]\displaystyle{ (s,a,s′) }[/math] tuple. As a result, the training procedure can be modified like so:

This goal resampling makes it possible to simultaneously learn how to reach multiple goals at once without needing more data from the environment. Thus, this simple modification can result in substantially faster learning. However, the method described above makes two major assumptions: (1) you have access to a reward function and (2) you have access to a goal sampling distribution [math]\displaystyle{ p(g) }[/math]. When moving to vision-based tasks where goals are images, both of these assumptions introduce practical concerns, as the task of generating goal images is fairly intensive.

For one, a fundamental problem with this reward function is that it assumes that the distance between raw images will yield semantically useful information. But images are noisy and a large amount of information in an image may not be related to the object we analyze. Thus, the distance between the two images may not correlate with their semantic distance.

Second, because the goals are images, a goal image distribution [math]\displaystyle{ p(g) }[/math] is needed so that one can sample goal images. Manually designing a distribution over goal images is a non-trivial task and image generation is still an active field of research. It would be ideal if the agent can autonomously imagine its own goals and learn how to reach them.

Retroactively generating goals is also explored in tabular domains in [15]and in continuous domains in [14] using hindsight experience replay (HER). However, HER is limited to sampling goals seen along a trajectory, which greatly limits the number and diversity of goals with which one can relabel a given transition.

Variational Autoencoder

Variational autoencoders can learn structured latent representations of high dimensional data. VAE contains an encoder [math]\displaystyle{ p_\phi }[/math] and a decoder [math]\displaystyle{ p_\psi }[/math]. The former maps states to latent distributions, while the later maps latents to distributions over states. these two are jointly trained to maximize:

[math]\displaystyle{ L(\psi,\phi;s^{(i)})=-\beta D_{KL}(q_\phi(z|s^{(i)}||p(z))+E_{q\phi(z|s^(i))}[log p_\psi(s^{(i)})|z]) }[/math]

where p(z) is a prior distribution, which is chosen to be unit Gaussian, [math]\displaystyle{ D_{KL} }[/math] is the Kullback-Leibler divergence, and [math]\displaystyle{ \beta }[/math] is a hyper-parameter that balances the two terms.

This generative model converts high-dimensional observations [math]\displaystyle{ x }[/math], like images, into low-dimensional latent variables [math]\displaystyle{ z }[/math], and vice versa. The model is trained so that the latent variables capture the underlying factors of variation in an image. A current image [math]\displaystyle{ x }[/math] and goal image [math]\displaystyle{ x_g }[/math] can be converted into latent variables [math]\displaystyle{ z }[/math] and [math]\displaystyle{ z_g }[/math], respectively. These latent variables can then be used to represent the state and goal for the reinforcement learning algorithm. Learning Q functions and policies on top of this low-dimensional latent space rather than directly on images result in faster learning.

The agent encodes the current image ([math]\displaystyle{ x }[/math]) and goal image ([math]\displaystyle{ x_g }[/math]) into a latent space and use distances in that latent space for reward. [9]

Using the latent variable representations for the images and goals also solves the problem of computing rewards. Instead of using pixel-wise error as our reward, the distance in the latent space is used as the reward to train the agent to reach a goal. The paper shows that this corresponds to rewarding reaching states that maximize the probability of the latent goal [math]\displaystyle{ z_g }[/math].

This generative model is also important because it allows an agent to easily generate goals in the latent space. In particular, the authors design the generative model so that latent variables are sampled from the VAE prior. This sampling mechanism is used for two reasons: First, it provides a mechanism for an agent to set its own goals. The agent simply samples a value for the latent variable from the generative model and tries to reach that latent goal. Second, this resampling mechanism is also used to relabel goals as mentioned above. Since the VAE prior is trained by real images, meaningful latent goals can be sampled from the latent variable prior. This will help the agent set its own goals and practice towards them if no goal is provided at test time.

Even without a human providing a goal, our agent can still generate its own goals, both for exploration and for goal relabeling. [9]

The authors summarize the purpose of the latent variable representation of images as follows: (1) captures the underlying factors of a scene, (2) provides meaningful distances to optimize, and (3) provides an efficient goal sampling mechanism which can be used by the agent to generate its own goals. The overall method is called reinforcement learning with imagined goals (RIG) by the authors. The process involves starts with collecting data through a simple exploration policy. Possible alternative explorations could be employed here including off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, a VAE latent variable model is trained on state observations and fine-tuned during training. The latent variable model is used for multiple purposes: sampling a latent goal [math]\displaystyle{ z_g }[/math] from the model and conditioning the policy on this goal. All states and goals are embedded using the model’s encoder and then used to train the goal-conditioned value function. The authors then resample goals from the prior and compute rewards in the latent space.

Goal-Conditioned Policies with Unsupervised Representation Learning

The choice of a suitable goal representation is required for the devising of practical goal-conditioned value functions. When there is absence of domain specific knowledge and instrumentation, a choice is to set the goal space G to be the same as the state observation space S. However, when state is high-dimensional learning a goal-conditioned Q-function and policy becomes exceedingly difficult. One challenging problem with end-to-end approaches for visual RL tasks is that the resulting policy needs to learn both perception and control. Training the goal-conditioned value function requires defining a goal-conditioned reward.

Their method jointly addresses a number of problems that arise when working with high-dimensional inputs such as images: sample efficient learning, reward specification, and automated goal-setting. These problems are addressed by learning a latent embedding using a [math]\displaystyle{ /beta - VAE }[/math]. This latent space is then used to represent the goal and state and retroactively relabel data with latent goals sampled from the VAE prior to improve sample efficiency.

Algorithm

Algorithm 1 is called reinforcement learning with imagined goals (RIG). The data is first collected via a simple exploration policy. The proposed model allows for alternate exploration policies to be used which include off-the-shelf exploration bonuses or unsupervised reinforcement learning methods. Then, the authors train a VAE latent variable model on state observations and finetune it over the course of training. VAE latent space modeling is used to allow the conditioning of policy on the goal which is sampled from the latent model. The VAE model is also used to encode all the goals and the states. When the goal-conditioned value function is trained, the authors resample prior goals and compute rewards in the latent space using the equation

[math]\displaystyle{ r(s, g) = - || z - z_g ||_A \propto \sqrt{log(e_{\Phi}(z_g | s))} }[/math]

.

This equation is derived from the equation below. This is based on the choice to use the negative Mahalanobis distance in the latent space for the reward:

[math]\displaystyle{ r(s, g) = - || e(s) - e(g) ||_A = - || z - z_g ||_A }[/math]

Experiments

The authors evaluated their method against some prior algorithms and ablated versions of their approach on a suite of simulated and real-world tasks: Visual Reacher, Visual Pusher, and Visual Multi-Object Pusher. They compared their model with the following prior works: L&R, DSAE, HER, and Oracle. It is concluded that their approach substantially outperforms the previous methods and is close to the state-based "oracle" method in terms of efficiency and performance.

The figure below shows the performance of different algorithms on this task. This involved a simulated environment with a Sawyer arm. The authors' algorithm was given only visual input, and the available controls were end-effector velocity. The plots show the distance to the goal state as a function of simulation steps. The Oracle, as a baseline, was given true object location information, as opposed to visual pixel information.


They then investigated the effectiveness of distances in the VAE latent space for the Visual Pusher task. They observed that latent distance significantly outperforms the log probability and pixel mean-squared error. The resampling strategies are also varied while fixing other components of the algorithm to study the effect of relabeling strategy. In this experiment, the RIG, which is an equal mixture of the VAE and Future sampling strategies, performs best. Subsequently, learning with variable numbers of objects was studied by evaluating on a task where the environment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects during testing. The results show that their model can tackle this task successfully.

Finally, the authors tested the RIG in a real-world robot for its ability to reach user-specified positions and push objects to desired locations, as indicated by a goal image. The robot is trained with access only to 84x84 RGB images and without access to joint angles or object positions. The robot first learns by settings its own goals in the latent space and autonomously practices reaching different positions without human involvement. After a reasonable amount of time of training, the robot is given a goal image. Because the robot has practiced reaching so many goals, it is able to reach this goal without additional training:

(Left) The robot setup is pictured. (Right) Test rollouts of the learned policy.

The method for reaching only needs 10,000 samples and an hour of real-world interactions.

They also used RIG to train a policy to push objects to target locations:

The robot pushing setup is pictured, with frames from test rollouts of the learned policy.

The pushing task is more complicated and the method requires about 25,000 samples. Since the authors do not have the true position during training, so they used test episode returns as the VAE latent distance reward. As learning proceeds, RIG makes steady progress at optimizing the latent distance.

Conclusion & Future Work

In this paper, a new RL algorithm is proposed to efficiently solve goal-conditioned, vision-based tasks without any ground truth state information or reward functions. The author suggests that one could instead use other representations, such as language and demonstrations, to specify goals. Also, while the paper provides a mechanism to sample goals for autonomous exploration, one can combine the proposed method with existing work by choosing these goals in a more principled way, i.e. a procedure that is not only goal-oriented, but also information seeking or uncertainty aware, to perform even better exploration. Furthermore, combining the idea of this paper with methods from multitask learning and meta-learning is a promising path to create general-purpose agents that can continuously and efficiently acquire skill. Lastly, there are a variety of robot tasks whose state representation would be difficult to capture with sensors, such as manipulating deformable objects or handling scenes with variable number of objects. It is interesting to see whether the RIG can be scaled up to solve these tasks. A new paper [10] was published last week that built on the framework of goal conditioned Reinforcement Learning to extract state representations based on the actions required to reach them, which is abbreviated ARC for actionable representation for control.

Critique

1. This paper is novel because it uses visual data and trains in an unsupervised fashion. The algorithm has no access to a ground truth state or to a pre-defined reward function. It can perform well in a real-world environment with no explicit programming.

2. From the videos, one major concern is that the output of robotic arm's position is not stable during training and test time. It is likely that the encoder reduces the image features too much so that the images in the latent space are too blurry to be used goal images. It would be better if this can be investigated in the future. It would be better, if a method is investigated with multiple data sources, and the agent is trained to choose the source which has more complete information.

3. The algorithm seems to perform better when there is only one object in the images. For example, in Visual Multi-Object Pusher experiment, the relative positions of two pucks do not correspond well with the relative positions of two pucks in goal images. The same situation is also observed in Variable-object experiment. We may guess that the more information contained in an image, the less likely the robot will perform well. This limits the applicability of the current algorithm to solving real-world problems.

4. The instability mentioned in #2 is even more apparent in the multi-object scenario and appears to result from the model attempting to optimize on the position of both objects at the same time. Reducing the problem to a sequence of single-object targets may reduce the amount of time the robots spend moving between the multiple objects in the scene (which it currently does quite frequently).

References

1. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

2. Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by Poking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems (NIPS), 2016.

3. Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-Shot Visual Imitation. In International Conference on Learning Representations (ICLR), 2018.

4. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

5. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement learning. International Conference on Machine Learning (ICML), 2017.

6. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal Planning Networks. In International Conference on Machine Learning (ICML), 2018.

7. Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey Levine. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888, 2017.

8. Alex Lee, Sergey Levine, and Pieter Abbeel. Learning Visual Servoing with Deep Features and Fitted Q-Iteration. In International Conference on Learning Representations (ICLR), 2017.

9. Online source: https://bair.berkeley.edu/blog/2018/09/06/rig/

10. https://arxiv.org/pdf/1811.07819.pdf

11. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research (JMLR), 17(1):1334–1373, 2016. ISSN 15337928.

12. Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric Actor Critic for Image-Based Robot Learning. arXiv preprint arXiv:1710.06542, 2017.

13. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

14. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mcgrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In Advances in Neural Information Processing Systems (NIPS) 2017.

15. L P Kaelbling. Learning to achieve goals. In IJCAI-93. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, volume vol.2, pages 1094 – 8, 1993.