http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Amanjhunjhunwala&feedformat=atomstatwiki - User contributions [US]2023-01-27T18:35:47ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Exploration_via_Bootstrapped_DQN&diff=31262Deep Exploration via Bootstrapped DQN2017-11-23T17:05:06Z<p>Amanjhunjhunwala: /* Critique */</p>
<hr />
<div>== Details ==<br />
<br />
'''Title''': Deep Exploration via Bootstrapped DQN<br />
<br />
'''Authors''': Ian Osband {1,2}, Charles Blundell {2}, Alexander Pritzel {2}, Benjamin Van Roy {1}<br />
<br />
'''Organisations''':<br />
# Stanford University<br />
# Google Deepmind<br />
<br />
'''Conference''': NIPS 2016<br />
<br />
'''URL''': [https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn papers.nips.cc]<br />
<br />
'''Online code sources'''<br />
* [https://github.com/iassael/torch-bootstrapped-dqn github.com/iassael/torch-bootstrapped-dqn]<br />
<br />
This summary contains background knowledge from Section 2-7 (except Section 5). Feel free to skip if you already know.<br />
<br />
== Intro to Reinforcement Learning ==<br />
<br />
In reinforcement learning, an agent interacts with an environment with the goal to maximize its long-term reward. A common application of reinforcement learning is to the [https://en.wikipedia.org/wiki/Multi-armed_bandit multi armed bandit problem]. In a multi armed bandit problem, there is a gambler and there are $n$ slot machines, and the gambler can choose to play any specific slot machine at any time. All the slot machines have their own probability distributions by which they churn out rewards, but this is unknown to the gambler. So the question is, how can the gambler learn the strategy to get the maximum long term reward?<br />
<br />
There are two things the gambler can do at any instance: either he can try a new slot machine, or he can play the slot machine he has tried before (and he knows he will get some reward). However, even though trying a new slot machine feels like it would bring less reward to the gambler, it is possible that the gambler finds out a new slot machine that gives a better reward than the current best slot machine. This is the dilemma of '''exploration vs exploitation'''. Trying out a new slot machine is '''exploration''', while redoing the best move so far is '''exploiting''' the currently understood perception of the reward.<br />
<br />
[[File:multiarmedbandit.jpg|thumb|Source: [https://blogs.mathworks.com/images/loren/2016/multiarmedbandit.jpg blogs.mathworks.com]]]<br />
<br />
There are many strategies to approach this '''exploration-exploitation dilemma'''. Some [https://web.stanford.edu/class/msande338/lec9.pdf common strategies] for optimizing in an exploration-exploitation setting are Random Walk, Curiosity-Driven Exploration, and Thompson Sampling. Random walk is a process that describes a path that consists of a succession of random steps on some mathematical space such as the integers. An elementary example of a random walk is the random walk on the integer number line, which starts at 0 and at each step moves +1 or −1 with equal probability.<br />
<br />
A lot of these approaches are provably efficient but assume that the state space is not very large. For instance, the approach called Curiosity-Driven Exploration aims to take actions that lead to immediate additional information. This requires the model to search “every possible cell in the grid” which is not desirable if state space is very large. Strategies for large state spaces often just either ignore exploration or do something naive like $\epsilon$-greedy for trading off exploration and exploitation, where you exploit with $1-\epsilon$ probability and explore "randomly" in rest of the cases. A detailed introduction and summary of curiosity driven exploration can be found in [17]. The general idea to tackle large or continuous state spaces is by value function approximation. An empirically tested strategy is Value Function Approximation using Fourier Basis [16]. It has also proven to perform well compared to radial basis functions and the polynomial basis, which are the two most popular fixed bases for linear value function approximation. <br />
<br />
This paper presents a new strategy for exploring deep reinforcement learning with discrete actions. In particular, the presented approach uses bootstrapped networks to approximate the posterior distribution of the Q-function. The bootstrapped neural network is comprised of numerous networks that have a shared layer for feature learning, but separate output layers - hence, each network learns a slightly different dataset thereby learning different Q-functions. In addition, the authors also showed that Thompson sampling can work with bootstrapped DQN reinforcement learning algorithm. For validation, the authors tested the proposed algorithm on various Atari benchmark gaming suites. This paper tries to use a Thompson sampling like approach to make decisions.<br />
<br />
== Thompson Sampling<sup>[[#References|[1]]]</sup> ==<br />
<br />
In Thompson sampling, our goal is to reach a belief that resembles the truth. Let's consider a case of coin tosses (2-armed bandit). Suppose we want to be able to reach a satisfactory pdf for $\mathbb{P}_h$ (heads). Assuming that this is a Bernoulli bandit problem, i.e. the rewards are $0$ or $1$, we can start off with $\mathbb{P}_h^{(0)}=\beta(1,1)$. The $\beta(x,y)$ distribution is a very good choice for a possible pdf because it works well for Bernoulli rewards. Further $\beta(1,1)$ is the uniform distribution $\mathbb{N}(0,1)$.<br />
<br />
Now, at every iteration $t$, we observe the reward $R^{(t)}$ and try to make our belief close to the truth by doing a Bayesian computation. Assuming $p$ is the probability of getting a heads,<br />
<br />
$$<br />
\begin{align*}<br />
\mathbb{P}(R|D) &\propto \mathbb{P}(D|R) \cdot \mathbb{P}(R) \\<br />
\mathbb{P}_h^{(t+1)}&\propto \mbox{likelihood}\cdot\mbox{prior} \\<br />
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \mathbb{P}_h^{(t)} \\<br />
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \beta(x_t, y_t) \\<br />
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot p^{x_t-1}(1-p)^{y_t-1} \\<br />
&\propto p^{x_t+R^{(t)}-1}(1-p)^{y_t+R^{(t)}-1} \\<br />
&\propto \beta(x_t+R^{(t)}, y_t+R^{(t)})<br />
\end{align*}<br />
$$<br />
<br />
[[File:thompson sampling coin example.png|thumb||||600px|Source: [https://www.quora.com/What-is-Thompson-sampling-in-laymans-terms Quora]]]<br />
<br />
This means that with successive sampling, our belief can become better at approximating the truth. There are similar update rules if we use a non-Bernoulli setting, say, Gaussian. In the Gaussian case, we start with $\mathbb{P}_h^{(0)}=\mathbb{N}(0,1)$ and given that $\mathbb{P}_h^{(t)}\propto\mathbb{N}(\mu, \sigma)$ it is possible to show that the update rule looks like<br />
<br />
$$<br />
\mathbb{P}_h^{(t+1)} \propto \mathbb{N}\bigg(\frac{t\mu+R^{(t)}}{t+1},\frac{\sigma}{\sigma+1}\bigg)<br />
$$<br />
<br />
=== How can we use this in reinforcement learning? ===<br />
<br />
We can use this idea to decide when to explore and when to exploit. We start with an initial belief, choose an action, observe the reward and based on the kind of reward, we update our belief about what action to choose next.<br />
<br />
== Bootstrapping <sup>[[#References|[2,3]]]</sup> ==<br />
<br />
This idea may be unfamiliar to some people, so I thought it would be a good idea to include this. In statistics, bootstrapping is a method to generate new samples from a given sample. Suppose that we have a given population, and we want to study a property $\theta$ of the population. So, we just find $n$ sample points (sample $\{D_i\}_{i=1}^n$), calculate the estimator of the property, $\hat{\theta}$, for these $n$ points, and make our inference. <br />
<br />
If we later wish to find some property related to the estimator $\hat{\theta}$ itself, e.g. we want a bound of $\hat{\theta}$ such that $\delta_1 \leq \hat{\theta} \leq \delta_2$ with a confidence of $c=0.95$, then we can use bootstrapping for this.<br />
<br />
Using bootstrapping, we can create a new sample $\{D'_i\}_{i=1}^{n'}$ by '''randomly sampling $n'$ times from $D$, with replacement'''. So, if $D=\{1,2,3,4\}$, a $D'$ of size $n'=10$ could be $\{1,4,4,3,2,2,2,1,3,4\}$. We do this a sufficient $k$ number of times, calculate $\hat{\theta}$ each time, and thus get a distribution $\{\hat{\theta}_i\}_{i=1}^k$. Now, we can choose the $100\cdot c$<sup>th</sup> and $100\cdot(1-c)$<sup>th</sup> percentile of this distribution, (let them be $\hat{\theta}_\alpha$ and $\hat{\theta}_\beta$ respectively) and say<br />
<br />
$$\hat{\theta}_\alpha \leq \hat{\theta} \leq \hat{\theta}_\beta, \mbox{with confidence }c$$<br />
<br />
== Why choose bootstrap and not dropout? ==<br />
<br />
There is previous work<sup>[[#References|[4]]]</sup> that establishes dropout as a good way to train NNs on a posterior such that the trained NN works like a function approximator that is close to the actual posterior. But, there are several problems with the predictions of this trained NN. The figures below are from the appendix of this paper. The left image is the NN trained by the authors of this paper on a sample noisy distribution and the right image is from the accompanying web demo from [[#References|[4]]], where the authors of [[#References|[4]]] show that their NN converges around the mean with a good confidence.<br />
<br />
[[File:dropout_results.png|thumb||center||700px|Source: this paper's appendix]]<br />
<br />
According to the authors of this paper,<br />
# Even though [[#References|[4]]] says that dropout converges arond the mean, their experiment actually behaves weirdly around a reasonable point like $x=0.75$. They think that this happens because dropout only affects the region local to the original data.<br />
# Samples from the NN trained on the original data do not look like a reasonable posterior (very spiky).<br />
# The trained NN collapses to zero uncertainty at the data points from the original data.<br />
<br />
Another reason I can think of not using dropout is that dropout increases the training time, even though it prevents overfitting.<br />
<br />
== Q Learning and Deep Q Networks (DQN) <sup>[[#References|[5]]]</sup> ==<br />
<br />
At any point in time, our rewards dictate what our actions should be. Also, in general, we want good long term rewards. For example, if we are playing a first-person shooter game, it is a good idea to go out of cover to kill an enemy, even if some health is lost. Similarly, in reinforcement learning, we want to maximize our long term reward. So if at each time $t$, the reward is $r_t$, then a naive way is to say we want to maximise<br />
<br />
$$<br />
R_t = \sum_{i=0}^{\infty}r_t<br />
$$<br />
<br />
But, this reward is unbounded. So technically it could tend to $\infty$ in a lot of the cases. This is why we use a '''discounted reward'''.<br />
<br />
$$<br />
R_t = \sum_{i=0}^{\infty}\gamma^t r_t<br />
$$<br />
<br />
Here, we take $0\leq \gamma \lt 1$. If it is equal to one, the agent values future reward just as much as the current reward. Conversely, a value of zero will cause the agent to only value immediate rewards, which only works with very detailed reward functions. So, what this means is that we value our current reward the most ($r_0$ has a coefficient of $1$), but we also consider the future possible rewards. So if we had two choices: get $+4$ now and $0$ at all other timesteps, or get $-2$ now and $+2$ after $3$ timesteps for $20$ timesteps, we choose the latter ($\gamma=0.9$). This is because $(+4) < (-2)+0.9^3(2+0.9\cdot2+\cdots+0.9^{19}\cdot2)$.<br />
<br />
<br />
A '''policy''' $\pi: \mathbb{S} \rightarrow \mathbb{A}$ is just a function that tells us what action to take in a given state $s\in \mathbb{S}$. Our goal is to find the best policy $\pi^*$ that maximises the reward from a given state $s$. So, a '''value function''' is defined from $s$ (which the agent is in, at timestep $t$) and following the policy $\pi$ as $V^\pi(s) = \mathbb{E}[R_t]$. The optimal value function is then simply<br />
<br />
$$<br />
V^*(s)=\displaystyle\max_{\pi}V^\pi(s)<br />
$$<br />
<br />
For convenience however, it is better to work with the '''Q function''' $Q: \mathbb{S}\times\mathbb{A} \rightarrow \mathbb{R}$. $Q$ is defined similarly as $V$. It is the expected return after taking an action $a$ in the given state $s$. So, $Q^\pi(s,a)=\mathbb{E}[R_t|s,a]$. The optimal $Q$ function is<br />
<br />
$$<br />
Q^*(s,a)=\displaystyle\max_{\pi}Q^\pi(s,a)<br />
$$<br />
<br />
Suppose that we know $Q^*$. Then, if we know that we are supposed to start at $s$ and take an action $a$ right now, what is the best course of action from the next time step? We just choose the optimal action $a'$ at the next state $s'$ that we reach. The optimal action $a'$ at state $s'$ is simply the argument $a_x$ that maximises our $Q^*(s',\cdot)$.<br />
<br />
$$<br />
a'=\displaystyle\arg\max_{a_x} Q^*(s',a_x)<br />
$$<br />
<br />
So, our best expected reward from $s$ taking action $a$ is $\mathbb{E}[r_t+\gamma\mathbb{E}[R_{t+1}]]$. This is known as the '''Bellman equation''' in optimal control problem (By the way, its continuous form is called '''Hamilton-Jacobi-Bellman equation''' or HJB equation, which is a very important partial differential equation):<br />
<br />
$$<br />
Q^*(s,a)=\mathbb{E}[r_t+\gamma \displaystyle\max_{a_x} Q^*(s',a_x)]<br />
$$<br />
<br />
In Q learning, we use a deep neural network with weights $\theta$ as a function approximator for $Q^*$, since Bellman equation is indeed a non-linear PDE and very difficult to solve numerically. The '''naive way''' to do this is to design a deep neural network that takes as input the state $s$ and action $a$, and produces an approximation to $Q^*$. <br />
<br />
* Suppose our neural net weights are $\theta_i$ at iteration $i$.<br />
* We want to train our neural net on the case when we are at $s$, take action $a$, get reward $r$, and reach $s'$.<br />
* To find out what action is best from $s'$, i.e. $a'$, we have to simulate all actions from $s'$. We can do this after we complete this iteration, then run $s',a_x$ for all $a_x\in\mathbb{A}$. But, we don't know how to complete this iteration without knowing this $a'$. So, another way is to simulate all actions from $s'$ using last known set of weights $\theta_{i-1}$. We just simulate state $s'$, action $a_x$ for all $a_x\in\mathbb{A}$ from the previous state and get $Q^*(s',a_x;\theta_{i-1})$. ('''Note''' that some papers do not use the set of weights from the previous iteration $\theta_{i-1}$. Instead they fix the weights for finding the best action for every $\tau$ steps to $\theta^-$, and do $Q^*(s',a_x;\theta^-)$ for $a_x\in\mathbb{A}$ and use this for the target value.)<br />
* Now we can compute our loss function using the Bellman equation, and backpropagate.<br />
$$<br />
\mbox{loss}=\mbox{target}-\mbox{prediction}=(r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1}))-Q^*(s,a;\theta_i)<br />
$$<br />
<br />
The '''problem''' with this approach is that at every iteration $i$, we have to do $|\mathbb{A}|$ forward passes on the previous set of weights $\theta_{i-1}$ to find out the best action $a'$ at $s'$. This becomes infeasible quickly with more possible actions.<br />
<br />
Authors of [[#References|[5]]] therefore use another kind of architecture. This architecture takes as input the state $s$, and computes the values $Q^*(s,a_x)$ for $a_x\in\mathbb{A}$. So there are $|\mathbb{A}|$ outputs. This basically parallelizes the forward passes so that $r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1})$ can be done with just a single pass through the outputs. The following figure illustrates this fact:<br />
<br />
[[File:hamid.png|thumb|500px|Source: David Silver slides|center]]<br />
<br />
<br />
[[File:DQN_arch.png|thumb||||600px|Source: [https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_7/DQNBreakoutBlocks.png leonardoaraujosantos.gitbooks.io]]]<br />
<br />
'''Note:''' When I say state $s$ as an input, I mean some representation of $s$. Since the environment is a partially observable MDP, it is hard to know $s$. So, we can for example, apply a CNN on the frames and get an idea of what the current state is. We pass this output to the input of the DNN (DNN is the fully connected layer for the CNN then).<br />
<br />
=== Experience Replay ===<br />
<br />
Authors of this paper borrow the concept of experience replay from [[#References|[5,6]]]. In experience replay, we do training in episodes. In each episode, we play and store consecutive $(s,a,r,s')$ tuples in the experience replay buffer. Then after the play, we choose random samples from this buffer and do our training.<br />
<br />
<br />
Advantages of experience replay over simple online Q learning<sup>[[#References|[5]]]</sup>:<br />
* '''Better data efficiency''': It is better to use one transition many times to learn again and again, rather than just learn once from it.<br />
* Learning from consecutive samples is difficult because of correlated data. Experience replay breaks this correlation.<br />
* Online learning means the input is decided by the previous action. So, if the maximising action is to go left in some game, next inputs would be about what happens when we go left. This can cause the optimiser to get stuck in a feedback loop, or even diverge, as [[#Reference|[7]]] points out.<br />
<br />
== Double Q Learning ==<br />
<br />
=== Problem with Q Learning<sup>[[#References|[8]]]</sup> ===<br />
<br />
For a simple neural network, each update tries to shift the current $Q^*$ estimate to a new value:<br />
<br />
$$<br />
Q^*(s,a) \leftarrow Q^*(s,a) + \alpha(r+\gamma\displaystyle\max_{a_x}Q^*(s',a_x) - Q^*(s,a))<br />
$$<br />
<br />
Here $\alpha$ is the scalar learning rate. Suppose the neural net has some inherent noise $\epsilon$. So, the neural net actually stores a value $\mathbb{Q}^*$ given by<br />
<br />
$$<br />
\mathbb{Q}^* = Q^*+\epsilon<br />
$$<br />
<br />
Even if $\epsilon$ has zero mean in the beginning, using the $\max$ operator at the update steps will start propagating $\gamma\cdot\max \mathbb{Q}^*$. This leads to a non zero mean subsequently. The problem is that "max causes overestimation because it does not preserve the zero-mean property of the errors of its operands." ([[#References|[8]]]) Thus, Q learning is more likely to choose overoptimistic values.<br />
<br />
=== How does Double Q Learning work? <sup>[[#References|[9]]]</sup> ===<br />
<br />
The problem can be solved by using two sets of weights $\theta$ and $\Theta$. The $\mbox{target}$ can be broken up as<br />
<br />
$$<br />
\mbox{target} = r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta) = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta) = r+Q^*(s',a';\theta)<br />
$$<br />
<br />
Using double Q learning, we '''select''' the best action using current weights $\theta$ and '''evaluate''' the $Q^*$ value to decide the target value using $\Theta$.<br />
<br />
$$<br />
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\Theta) = r+Q^*(s',a';\Theta)<br />
$$<br />
<br />
This makes the evaluation fairer.<br />
<br />
=== Double Deep Q Learning ===<br />
<br />
[[#References|[9]]] further talks about how to use this for deep learning without much additional overhead. The suggestion is to use $\theta^-$ as $\Theta$.<br />
<br />
$$<br />
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta^-) = r+Q^*(s',a';\theta^-)<br />
$$<br />
=== Final DQN used in this paper ===<br />
The authors combine the idea of double DQN discussed above with the loss function discussed in "Q Learning and Deep Q Networks" section. So here is the final update for parameters of action value function:<br />
<br />
$$<br />
\theta_{t+1} \leftarrow \theta_t + \alpha(y_t^Q -Q(s_t,a_t;\theta_t))\nabla_{\theta}Q(s_t,a_t;\theta_t)<br />
$$<br />
$$<br />
y_t^Q \leftarrow r_t + \gamma Q(s_{t+1}, \underset{a}{argmax} \ Q(s_{t+1},a;\theta_t);\theta^{-})<br />
$$<br />
<br />
== Bootstrapped DQN ==<br />
<br />
The authors propose an architecture that has a shared network and $K$ bootstrap heads. So, suppose our experience buffer $E$ has $n$ data points, where each datapoint is a $(s,a,r,s')$ tuple. Each bootstrap head trains on a different buffer $E_i$, where each $E_i$ has been constructed by sampling $n$ data points from the original experience buffer $E$ with replacement ('''bootstrap method''').<br />
<br />
<br />
Since each of the heads train on a different buffer, they model a different $Q^*$ function (say $Q^*_k$). Now, for each episode, we first choose a specific $Q^*_k=Q^*_s$. This $Q^*_s$ helps us create the experience buffer for the episode. From any state $s_t$, we populate the experience buffer by choosing the next action $a_t$ that maximises $Q^*_s$. (similar to '''Thompson Sampling''')<br />
<br />
$$<br />
a_t = \displaystyle\arg\max_a Q^*_s(s_t,a_t)<br />
$$<br />
<br />
Also, along with $s_t,a_t,r_t,s_{t+1}$, they push a bootstrap mask $m_t$. This mask is basically is a binary vector of size $K$, and it tells which $Q_k$ should be affected by this datapoint, if it is chosen as a training point. So, for example, if $K=5$ and there is a experience tuple $(s_t,a_t,r_t,s_{t+1},m_t)$ where $m_t=(0,1,1,0,1)$, then $(s_t,a_t,r_t,s_{t+1})$ should only affect $Q_2,Q_3$ and $Q_5$.<br />
<br />
<br />
So, at each iteration, we just choose few points from this buffer and train the respective $Q_{(\cdot)}$ based on the bootstrap masks.<br />
<br />
=== How to generate masks? ===<br />
<br />
Masks are created by sampling from the '''masking distribution'''. Now, there are many ways to choose this masking distribution:<br />
<br />
* If for each datapoint $D_i$ ($i=1$ to $n$), we mask from $\mbox{Bernoulli}(0.5)$, this will roughly allow us to have half the points from the original buffer. To get to size $n$, we duplicate these points by doubling the weights for each datapoint. This essentially gives us a '''double or nothing''' bootstrap<sup>[[#References|[10]]]</sup>.<br />
* If the mask is $(1, 1 \cdots 1)$, then this becomes an '''ensemble learning''' method.<br />
* $m_t~\mbox{Poi}(1)$ (poisson distribution)<br />
* $m_t[k]~\mbox{Exp}(1)$ (exponential distribution)<br />
<br />
For this paper's results, the authors used a $\mbox{Bernoulli}(p)$ distribution.<br />
<br />
== Related Work ==<br />
<br />
The authors mention the method described in [[#References|[11]]]. The authors of [[#References|[11]]] talk about the principle of "optimism in the face of uncertainty" and modify the reward function to encourage state-action pairs that have not been seen often:<br />
<br />
$$<br />
R(s,a) \leftarrow R(s,a)+\beta\cdot\mbox{novelty}(s,a)<br />
$$<br />
<br />
According to the authors, [[#References|[11]]]'s DQN algorithm relies on a lot of hand tuning and is only good for non stochastic problems. The authors further compare their results to [[#References|[11]]]'s results on Atari.<br />
<br />
<br />
The authors also mention an existing algorithm PSRL<sup>[[#References|[12,13]]]</sup>, or posterior sampling based RL. However, this algorithm requires a solved MDP, which is not feasible for large systems. Bootstrapped DQN approximates this idea by sampling from approximate $Q^*$ functions.<br />
<br />
<br />
Further, the authors mention that the work in [[#References|[12,13]]] has been followed by RLSVI<sup>[[#Reference|[14]]]</sup> which solves the problem for linear cases.<br />
<br />
== Deep Exploration: Why is Bootstrapped DQN so good at it? ==<br />
<br />
The authors consider a simple example to demonstrate the effectiveness of bootstrapped DQN at deep exploration.<br />
<br />
[[File:deep_exploration_example.png|thumb||center||700px|Source: this paper, section 5.1]]<br />
<br />
<br />
<br />
In this example, the agent starts at $s_2$. There are $N$ steps, and $N+9$ timesteps to generate the experience buffer. The agent is said to have learned the optimal policy if it achieves the best possible reward of $10$ (go to the rightmost state in $N-1$ timesteps, then stay there for $10$ timesteps), for at least $100$ such episodes. The results they got:<br />
<br />
[[File:deep_exploration_results.png|thumb||center||700px|Source: this paper, section 5.1]]<br />
<br />
<br />
<br />
The blue dots indicate when the agent learned the optimal policy. If this took more than $2000$ episodes, they indicate it with a red dot. Thompson DQN is DQN with posterior sampling at every timestep. Ensemble DQN is same as bootstrapped DQN except that the mask is all $(1,1 \cdots 1)$. It is evident from the graphs that bootstrapped DQN can achieve deep exploration better than these two methods and DQN.<br />
<br />
=== But why is it better? ===<br />
<br />
The authors say that this is because bootstrapped DQN constructs different approximations to the posterior $Q^*$ with the same initial data. This diversity of approximations is because of random initialization of weights for the $Q^*_k$ heads. This means that these heads start out trying random actions (because of diverse random initial $Q^*_k$), but when some head finds a good state and generalizes to it, some (but not all) of the heads will learn from it, because of the bootstrapping. Eventually, other heads will either find other good states or end up learning the best good states found by the other heads.<br />
<br />
<br />
So, the architecture explores well and once a head achieves the optimal policy, eventually, all heads achieve the policy.<br />
<br />
== Results ==<br />
<br />
The authors test their architecture on 49 Atari games. They mention that there has been recent work to improve the performance of DDQNs, but those are tweaks whose intentions are orthogonal to this paper's idea. So, they don't compare their results with them.<br />
<br />
=== Scale: What values of $K$, $p$ are best? ===<br />
<br />
[[File:scale_k_p.png|thumb||center||800px|Source: this paper, section 6.1]]<br />
<br />
Recall that $K$ is the number of bootstrap heads and $p$ is the parameter for the masking distribution (Bernoulli). The authors say that around $K=10$, the performance reaches close to the peak, so it should be good.<br />
<br />
<br />
$p$ also represents the amount of data sharing. This is because lesser $p$ means there is a lesser chance (due to the Bernoulli distribution) that the corresponding datapoint is taken into the bootstrapped dataset $D_i$. So, lesser $p$ means more identical datapoints, hence more heads share their datapoints. However, the value of $p$ doesn't seem to affect the rewards achieved over time. The authors give the following reasons for it:<br />
<br />
* The heads start with random weights for $Q^*$, so the targets (which use $Q^*$) turn out to be different. So the update rules are different.<br />
* Atari is deterministic.<br />
* Because of the initial diversity, the heads will learn differently even if they predict the same action for the given state.<br />
<br />
$p=1$ is the value they use finally because this reduces the no. of identical datapoints and reduces time.<br />
<br />
=== Performance on Atari ===<br />
<br />
In general, the results tell us that bootstrapped DQN achieves better results.<br />
<br />
[[File:atari_results_bootstrapped_dqn.png|thumb||center||800px|Source: this paper, section 6.2]]<br />
<br />
The authors plot the improvement they achieved with bootstrapped DQN with the games. They define '''improvement''' to be $x$ if bootstrapped DQN achieves a better result than DQN in $\frac{1}{x}$ frames.<br />
<br />
[[File:bdqn_improvement.png|thumb||center||1000px|Source: this paper, section 6.2]]<br />
<br />
<br />
The authors say that bootstrapped DQN doesn't work good on all Atari games. They point out that there are some challenging games, where exploration is key but bootstrapped DQN doesn't do good enough (but does better than DQN). Some of these games are Frostbite and Montezuma’s Revenge. They say that even better exploration may help, but also point out that there may be other problems such as network instability, reward clipping, and temporally extended rewards.<br />
<br />
=== Improvement: Highest Score Reached & how fast is this high score reached? ===<br />
<br />
The authors plot the improvement graphs after 20m and 200m frames.<br />
<br />
[[File:cumulative_rewards_bdqn.png|thumb||center||700px|Source: this paper, section 6.3]]<br />
<br />
=== Visualisation of Results ===<br />
<br />
One of the authors' [https://www.youtube.com/playlist?list=PLdy8eRAW78uLDPNo1jRv8jdTx7aup1ujM youtube playlist] can be found online.<br />
<br />
<br />
The authors also point out that just purely using bootstrapped DQN as an exploitative strategy is pretty good by itself, better than vanilla DQN. This is because of the deep exploration capabilities of bootstrapped DQN, since it can use the best states it knows and also plans to try out states it doesn't have any information about. Even in the videos, it can be seen that the heads agree at all the crucial decisions, but stay diverse at other less important steps.<br />
<br />
== Critique ==<br />
A reviewer pointed to the fact that "this paper is a bit hard to follow at times, and you have to go all the way to the appendix to get a good understanding of how the entire algorithm comes together and works. The step-by-step algorithm description could be more complete (there are steps of the training process left out, albeit they are not unique to Bootstrap DQN) and should not be hidden down in the appendix. This should probably be in Section 3. The MDP examples in Section 5 were not explained well; it feels like it doesn’t contribute too much to the overall impact of the paper."<br />
<br />
It would be very interesting and a great addition to the experimental section of the paper, if the authors would have compared with asynchronous methods of exploration of the state space first introduced in [[#References|[15]]]. The authors unfortunately only compared their DQN with the original DQN and not all the other variations in the literature and justified it by saying that their idea was "orthogonal" to these improvements.<br />
<br />
=== Different way to do exploration-exploitation? ===<br />
<br />
Instead of choosing the next action $a_t$ that maximises $Q^*_s$, they could have chosen different actions $a_i$ with probabilities<br />
<br />
$$<br />
\mathbb{P}(s_t,a_i) = \frac{Q^*_s(s_t,a_i)}{\displaystyle \sum_{i=1}^{|\mathbb{A}|} Q^*_s(s_t,a_i)}<br />
$$<br />
<br />
According to me, this is closer to Thompson Sampling.<br />
<br />
=== Why use Bernoulli? ===<br />
<br />
The choice of having a Bernoulli masking distribution eventually doesn't help them at all, since the algorithm does well because of the initial diversity. Maybe they can use some other masking distribution? However, the bootstrapping procedure is distribution-independent, the choice of masking distribution should not affect the long term performance of Bootstrapped DQN.<br />
<br />
=== Unanswered Questions & Miscellaneous ===<br />
* The Thompson DQN is not preferred because other randomized value functions can implement settings similar to Thompson sampling without the need for an intractable exact posterior update and also by working around the computational issue with Thompson Sampling: resampling every time step. Perhaps the authors could have explored Temporal Difference learning which is an attempt at combining Dynamic Programming and Monte Carlo methods.<br />
* The actual algorithm is hidden in the appendix. It could have been helpful if it were in the main paper.<br />
* The authors compare their results only with the vanilla DQN architecture and claim that the method taken in this paper is orthogonal to other improvements such as DDQN and the dueling network architecture. In my opinion, the fact that multiple but identical DQNs manage to solve the problem better than one is interesting, but it also implies that the single DQN still has inherent problems. It is not clear to me why the authors claim that their method is orthogonal to other methods, and actually it might help to solve some of the unstable behaviors of DQN.<br />
<br />
=== Improvement on Hierarchical Tasks ===<br />
In this paper, it is interesting that such bootstrap exploration principle actually improves the performance over hierarchical tasks such as Montezuma Revenge. It would be better if the authors could illustrate further about the influence of exploration in a sparse-reward hierarchical task.<br />
<br />
== References ==<br />
<br />
# [https://bandits.wikischolars.columbia.edu/file/view/Lecture+4.pdf Learning and optimization for sequential decision making, Columbia University, Lec 4]<br />
# [https://www.thoughtco.com/what-is-bootstrapping-in-statistics-3126172 Thoughtco, What is bootstrapping in statistics?]<br />
# [https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading24.pdf Bootstrap confidence intervals, Class 24, 18.05, MIT Open Courseware]<br />
# [https://arxiv.org/abs/1506.02142 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142, 2015.]<br />
# [https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf Mnih et al., Playing Atari with Deep Reinforcement Learning, 2015]<br />
# Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.<br />
# John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.<br />
# S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning, 1993.<br />
# [https://arxiv.org/pdf/1509.06461.pdf Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015.]<br />
# [https://pdfs.semanticscholar.org/d623/c2cbf100d6963ba7dafe55158890d43c78b6.pdf Dean Eckles and Maurits Kaptein, Thompson Sampling with the Online Bootstrap, 2014, Pg 3]<br />
# [https://arxiv.org/abs/1507.00814 Bradly C. Stadie, Sergey Levine, Pieter Abbeel, Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models, 2015.]<br />
# Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling, NIPS 2013.<br />
# Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension, NIPS 2014.<br />
# [https://arxiv.org/abs/1402.0635 Ian Osband, Benjamin Van Roy, Zheng Wen, Generalization and Exploration via Randomized Value Functions, 2014.]<br />
# Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International Conference on Machine Learning. 2016.<br />
# George Konidaris, Sarah Osentoski, and Philip Thomas. 2011. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence (AAAI'11). AAAI Press 380-385.<br />
# P.Y. Oudeyer, F. Kaplan. What is Intrinsic Motivation? A Typology of Computational Approaches. Front Neurorobotics. 2007.<br />
<br />
Other helpful links (unsorted):<br />
* [http://pemami4911.github.io/paper-summaries/deep-rl/2016/08/16/Deep-exploration.html pemami4911.github.io]<br />
* [http://www.stat.yale.edu/~pollard/Courses/241.fall97/Poisson.pdf Poisson Approximations]<br />
<br />
== Appendix ==<br />
<br />
=== Algorithm for Bootstrapped DQN ===<br />
The appendix lists the following algorithm. Periodically, the replay buffer is played back to update value function network Q.<br />
<br />
[[File:alg1.PNG|thumb||left||700px|Source: this paper's appendix]]</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Alternative_Neural_Network:_Exploring_Contexts_As_Early_As_Possible_For_Action_Recognition&diff=31259Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition2017-11-23T16:59:54Z<p>Amanjhunjhunwala: /* Optic Flow */</p>
<hr />
<div>==Introduction==<br />
<br />
Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages. <br />
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]<br />
<br />
The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The networks themselves involve a lot of layers, with the first layer typically having a receptive field (RF) that outputs only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs. In summary, this paper proposes a novel neural network, called deep alternative neural network (DANN), which is a based method for action recognition. The novel component is called an "alternative layer" which is composed of a volumetric convolutional layer followed by a recurrent layer. In addition, the authors also propose a new approach to select network input based on optical flow. The validity of DANN is carried out on HMDB51 and UCF101 datasets and it is observed that the proposed method achieves comparable performance against state of the art methods.<br />
<br />
The main contributions in the paper can be summarized as follows: <br />
* A Deep Alternative Neural Network (DANN) is proposed for action recognition. <br />
* DANN consists of alternative volumetric convolutional and recurrent layers. <br />
* An adaptive method to determine the temporal size of the video clip <br />
* A volumetric pyramid pooling layer to resize the output before fully connected layers.<br />
<br />
===Related Work===<br />
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular, the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This similarity between the two works was noted by Reviewer 1 in the NIPS review process.<br />
<br />
=== Optic Flow ===<br />
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.<br />
It can be used for affordance perception, the ability to discern possibilities for action within the environment. The following image describes the optical flow :[[File:oflow.png | thumb | 250px]]<br />
<br />
==Deep Alternative Neural Network:==<br />
===Adaptive Network Input===<br />
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with a shorter interval. However, there’s still no systematic way of determining the number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to: <br />
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion <br />
* it is relatively robust to changes in camera viewpoint.<br />
<br />
The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows <br />
<br />
:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math><br />
<br />
Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.<br />
<br />
[[File:golfswing.png]]<br />
<br />
To deal with the different length of video clip, we adopt the idea of spatial pyramid pooling (SPP) in [12] and extend to temporal domain, developing a volumetric pyramid pooling (VPP) layer to transfer video clip of arbitrary size into a universal length in the last alternative layer before fully connected layer.<br />
<br />
===Alternative Layer===<br />
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t, $u_{ij}^{xyz}(t)$, is given by,<br />
<br />
:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\ <br />
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz}) <br />
</math><br />
<br />
U(0): feed forward output of volumetric convolutional layer. <br />
U(t-1) : recurrent input of previous time <br />
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively <br />
f: ReLU function followed by a local response normalization (LRN), which mimics the lateral inhibition in the cortex where different features compete for<br />
large responses. <br />
<br />
Figure 3 depicts this structure:<br />
[[File:unfolded.PNG|1000px]]<br />
<br />
The recurrent connections in AL provide two advantages. First, they enable every unit to incorporate contexts in an arbitrarily large region in the current layer。 However, the drawback is that without top-down connections, the states of the units in the current layer cannot be influenced by the context seen by higher-level units; Second, the recurrent connections increase the network depth while keeping the number of adjustable parameters constant by weight sharing, because AL uses only extra constant parameters of a recurrent kernel size.<br />
<br />
===Volumetric Pyramid Pooling Layer===<br />
<br />
[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]<br />
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer (VPPL) as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. Figure 2 illustrates the structure of VPPL. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:<br />
<br />
M: number of bins <br />
<br />
K: Number of kernels in the last alternative layer.<br />
<br />
This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.<br />
<br />
It reminds me of the spatial pyramid pooling in deep convolutional networks. In CNN, the dimensions of the training data are the same, so that after convolution, we can train the classifiers effectively. To improve the limit of the same dimension, spatial pyramid pooling is introduced.<br />
<br />
==Overall Architecture== <br />
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]<br />
The following are the components of the DANN (as shown in Figure 3)<br />
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps <br />
* 5 ReLU and volumetric pooling layers <br />
* 1 volumetric pyramid pooling layer <br />
* 3 fully connected layers of size 2048 each <br />
* A softmax layer<br />
<br />
==Implementation details==<br />
The authors have used the Torch toolbox platform for implementations of volumetric convolutions, recurrent layers, and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. This technique is preferred to the common alternative of pre-processing data using a sliding window approach to have pre-segmented clips. The authors cite how using this technique limits the amount of data when the windows are not overlapped with one another. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. The final score is the average of all clip-level scores and the crop scores.<br />
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].<br />
<br />
==Evaluations==<br />
===Datasets:===<br />
* The datasets used in the evaluation are UCF101 and HMDB51 <br />
* UCF101 – 13K videos annotated into 101 classes <br />
* HMDB51 – 6.8K videos with 51 actions. <br />
* Three training and test splits are provided <br />
* Performance measured by mean classification accuracy across the splits. <br />
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.<br />
<br />
===Quantitative Results===<br />
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.<br />
<br />
<br />
[[File:Performance Comparison of different input modalities.png]]<br />
<br />
===Qualitative Analysis===<br />
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.<br />
<br />
==Conclusions and Critique==<br />
* Deep alternative neural network is introduced for action recognition.<br />
* The key new component is an "alternative layer" which is composed of a convolutional layer followed by a recurrent layer. As the paper targets action recognition in video, the convolutional layer acts on a 3D spatio-temporal volume.<br />
* DANN consists of volumetric convolutional layer and a recurrent layer. <br />
* A preprocessing stage based on optical flow is used to select video fragments to feed to the neural network.<br />
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches. <br />
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement. <br />
* There are prospects for studying action tube which is a more compact input.<br />
* The paper uses volumetric convolutional layer, but it doesn't say how it is better than recurrent neural networks in exploring temporal information.<br />
* There is no experimental evidence to compare the proposed method with long-term recurrent convolutional network. Also, there is no analysis of time complexity of the approach used.<br />
<br />
Github code: https://github.com/wangjinzhuo/DANN<br />
<br />
In the formal review of the paper [https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html], some interesting criticisms of the paper are surfaced. For starters, one reviewer notes that a similar architecture was proposed in [https://arxiv.org/abs/1511.06432], limiting the novelty of the approach somewhat. The reviewers question the validity of the approach in even slightly more complicated settings (i.e. any non-static camera, which brings in the issue of optical flow). Other criticisms come from a lack of clear motivation for choices that the authors have made, for instance, the use of Local Response Normalization has fallen slightly out-of-favour, or the benefit of using a sliding window approach during testing (and random clips during training).<br />
<br />
Quantitatively, the benefits of the author's approach are not readily apparent. In comparisons with state-of-the-art, the proposed model performs worse on HMDB, and while they claim the highest performance on UCF, the increase is merely .1 over previous best efforts.<br />
<br />
==References==<br />
<br />
[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014 <br />
<br />
[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. <br />
<br />
[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015. <br />
<br />
[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013. <br />
<br />
[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. <br />
<br />
[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016. <br />
<br />
[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015. <br />
<br />
[8] IEEE International Symposium on Multimedia 2013 <br />
<br />
[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action<br />
recognition. arXiv preprint arXiv:1604.04494, 2016<br />
<br />
[10] https://en.wikipedia.org/wiki/Optical_flow<br />
<br />
[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016 <br />
<br />
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 37(9):1904–1916, 2015.<br />
<br />
[13] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l<br />
1 optical flow. In Pattern Recognition, pages 214–223. 2007.<br />
<br />
A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Alternative_Neural_Network:_Exploring_Contexts_As_Early_As_Possible_For_Action_Recognition&diff=31258Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition2017-11-23T16:59:44Z<p>Amanjhunjhunwala: /* Optic Flow */</p>
<hr />
<div>==Introduction==<br />
<br />
Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages. <br />
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]<br />
<br />
The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The networks themselves involve a lot of layers, with the first layer typically having a receptive field (RF) that outputs only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs. In summary, this paper proposes a novel neural network, called deep alternative neural network (DANN), which is a based method for action recognition. The novel component is called an "alternative layer" which is composed of a volumetric convolutional layer followed by a recurrent layer. In addition, the authors also propose a new approach to select network input based on optical flow. The validity of DANN is carried out on HMDB51 and UCF101 datasets and it is observed that the proposed method achieves comparable performance against state of the art methods.<br />
<br />
The main contributions in the paper can be summarized as follows: <br />
* A Deep Alternative Neural Network (DANN) is proposed for action recognition. <br />
* DANN consists of alternative volumetric convolutional and recurrent layers. <br />
* An adaptive method to determine the temporal size of the video clip <br />
* A volumetric pyramid pooling layer to resize the output before fully connected layers.<br />
<br />
===Related Work===<br />
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular, the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This similarity between the two works was noted by Reviewer 1 in the NIPS review process.<br />
<br />
=== Optic Flow ===<br />
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.<br />
It can be used for affordance perception, the ability to discern possibilities for action within the environment. The following image describes the optical flow :[[File:oflow.png | thumb | 50px]]<br />
<br />
==Deep Alternative Neural Network:==<br />
===Adaptive Network Input===<br />
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with a shorter interval. However, there’s still no systematic way of determining the number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to: <br />
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion <br />
* it is relatively robust to changes in camera viewpoint.<br />
<br />
The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows <br />
<br />
:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math><br />
<br />
Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.<br />
<br />
[[File:golfswing.png]]<br />
<br />
To deal with the different length of video clip, we adopt the idea of spatial pyramid pooling (SPP) in [12] and extend to temporal domain, developing a volumetric pyramid pooling (VPP) layer to transfer video clip of arbitrary size into a universal length in the last alternative layer before fully connected layer.<br />
<br />
===Alternative Layer===<br />
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t, $u_{ij}^{xyz}(t)$, is given by,<br />
<br />
:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\ <br />
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz}) <br />
</math><br />
<br />
U(0): feed forward output of volumetric convolutional layer. <br />
U(t-1) : recurrent input of previous time <br />
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively <br />
f: ReLU function followed by a local response normalization (LRN), which mimics the lateral inhibition in the cortex where different features compete for<br />
large responses. <br />
<br />
Figure 3 depicts this structure:<br />
[[File:unfolded.PNG|1000px]]<br />
<br />
The recurrent connections in AL provide two advantages. First, they enable every unit to incorporate contexts in an arbitrarily large region in the current layer。 However, the drawback is that without top-down connections, the states of the units in the current layer cannot be influenced by the context seen by higher-level units; Second, the recurrent connections increase the network depth while keeping the number of adjustable parameters constant by weight sharing, because AL uses only extra constant parameters of a recurrent kernel size.<br />
<br />
===Volumetric Pyramid Pooling Layer===<br />
<br />
[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]<br />
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer (VPPL) as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. Figure 2 illustrates the structure of VPPL. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:<br />
<br />
M: number of bins <br />
<br />
K: Number of kernels in the last alternative layer.<br />
<br />
This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.<br />
<br />
It reminds me of the spatial pyramid pooling in deep convolutional networks. In CNN, the dimensions of the training data are the same, so that after convolution, we can train the classifiers effectively. To improve the limit of the same dimension, spatial pyramid pooling is introduced.<br />
<br />
==Overall Architecture== <br />
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]<br />
The following are the components of the DANN (as shown in Figure 3)<br />
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps <br />
* 5 ReLU and volumetric pooling layers <br />
* 1 volumetric pyramid pooling layer <br />
* 3 fully connected layers of size 2048 each <br />
* A softmax layer<br />
<br />
==Implementation details==<br />
The authors have used the Torch toolbox platform for implementations of volumetric convolutions, recurrent layers, and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. This technique is preferred to the common alternative of pre-processing data using a sliding window approach to have pre-segmented clips. The authors cite how using this technique limits the amount of data when the windows are not overlapped with one another. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. The final score is the average of all clip-level scores and the crop scores.<br />
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].<br />
<br />
==Evaluations==<br />
===Datasets:===<br />
* The datasets used in the evaluation are UCF101 and HMDB51 <br />
* UCF101 – 13K videos annotated into 101 classes <br />
* HMDB51 – 6.8K videos with 51 actions. <br />
* Three training and test splits are provided <br />
* Performance measured by mean classification accuracy across the splits. <br />
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.<br />
<br />
===Quantitative Results===<br />
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.<br />
<br />
<br />
[[File:Performance Comparison of different input modalities.png]]<br />
<br />
===Qualitative Analysis===<br />
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.<br />
<br />
==Conclusions and Critique==<br />
* Deep alternative neural network is introduced for action recognition.<br />
* The key new component is an "alternative layer" which is composed of a convolutional layer followed by a recurrent layer. As the paper targets action recognition in video, the convolutional layer acts on a 3D spatio-temporal volume.<br />
* DANN consists of volumetric convolutional layer and a recurrent layer. <br />
* A preprocessing stage based on optical flow is used to select video fragments to feed to the neural network.<br />
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches. <br />
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement. <br />
* There are prospects for studying action tube which is a more compact input.<br />
* The paper uses volumetric convolutional layer, but it doesn't say how it is better than recurrent neural networks in exploring temporal information.<br />
* There is no experimental evidence to compare the proposed method with long-term recurrent convolutional network. Also, there is no analysis of time complexity of the approach used.<br />
<br />
Github code: https://github.com/wangjinzhuo/DANN<br />
<br />
In the formal review of the paper [https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html], some interesting criticisms of the paper are surfaced. For starters, one reviewer notes that a similar architecture was proposed in [https://arxiv.org/abs/1511.06432], limiting the novelty of the approach somewhat. The reviewers question the validity of the approach in even slightly more complicated settings (i.e. any non-static camera, which brings in the issue of optical flow). Other criticisms come from a lack of clear motivation for choices that the authors have made, for instance, the use of Local Response Normalization has fallen slightly out-of-favour, or the benefit of using a sliding window approach during testing (and random clips during training).<br />
<br />
Quantitatively, the benefits of the author's approach are not readily apparent. In comparisons with state-of-the-art, the proposed model performs worse on HMDB, and while they claim the highest performance on UCF, the increase is merely .1 over previous best efforts.<br />
<br />
==References==<br />
<br />
[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014 <br />
<br />
[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. <br />
<br />
[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015. <br />
<br />
[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013. <br />
<br />
[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. <br />
<br />
[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016. <br />
<br />
[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015. <br />
<br />
[8] IEEE International Symposium on Multimedia 2013 <br />
<br />
[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action<br />
recognition. arXiv preprint arXiv:1604.04494, 2016<br />
<br />
[10] https://en.wikipedia.org/wiki/Optical_flow<br />
<br />
[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016 <br />
<br />
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 37(9):1904–1916, 2015.<br />
<br />
[13] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l<br />
1 optical flow. In Pattern Recognition, pages 214–223. 2007.<br />
<br />
A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Alternative_Neural_Network:_Exploring_Contexts_As_Early_As_Possible_For_Action_Recognition&diff=31256Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition2017-11-23T16:59:14Z<p>Amanjhunjhunwala: /* Optic Flow */</p>
<hr />
<div>==Introduction==<br />
<br />
Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages. <br />
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]<br />
<br />
The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The networks themselves involve a lot of layers, with the first layer typically having a receptive field (RF) that outputs only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs. In summary, this paper proposes a novel neural network, called deep alternative neural network (DANN), which is a based method for action recognition. The novel component is called an "alternative layer" which is composed of a volumetric convolutional layer followed by a recurrent layer. In addition, the authors also propose a new approach to select network input based on optical flow. The validity of DANN is carried out on HMDB51 and UCF101 datasets and it is observed that the proposed method achieves comparable performance against state of the art methods.<br />
<br />
The main contributions in the paper can be summarized as follows: <br />
* A Deep Alternative Neural Network (DANN) is proposed for action recognition. <br />
* DANN consists of alternative volumetric convolutional and recurrent layers. <br />
* An adaptive method to determine the temporal size of the video clip <br />
* A volumetric pyramid pooling layer to resize the output before fully connected layers.<br />
<br />
===Related Work===<br />
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular, the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This similarity between the two works was noted by Reviewer 1 in the NIPS review process.<br />
<br />
=== Optic Flow ===<br />
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.<br />
It can be used for affordance perception, the ability to discern possibilities for action within the environment. The following image describes the optical flow :[[File:oflow.png | frame | 50px]]<br />
<br />
==Deep Alternative Neural Network:==<br />
===Adaptive Network Input===<br />
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with a shorter interval. However, there’s still no systematic way of determining the number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to: <br />
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion <br />
* it is relatively robust to changes in camera viewpoint.<br />
<br />
The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows <br />
<br />
:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math><br />
<br />
Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.<br />
<br />
[[File:golfswing.png]]<br />
<br />
To deal with the different length of video clip, we adopt the idea of spatial pyramid pooling (SPP) in [12] and extend to temporal domain, developing a volumetric pyramid pooling (VPP) layer to transfer video clip of arbitrary size into a universal length in the last alternative layer before fully connected layer.<br />
<br />
===Alternative Layer===<br />
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t, $u_{ij}^{xyz}(t)$, is given by,<br />
<br />
:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\ <br />
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz}) <br />
</math><br />
<br />
U(0): feed forward output of volumetric convolutional layer. <br />
U(t-1) : recurrent input of previous time <br />
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively <br />
f: ReLU function followed by a local response normalization (LRN), which mimics the lateral inhibition in the cortex where different features compete for<br />
large responses. <br />
<br />
Figure 3 depicts this structure:<br />
[[File:unfolded.PNG|1000px]]<br />
<br />
The recurrent connections in AL provide two advantages. First, they enable every unit to incorporate contexts in an arbitrarily large region in the current layer。 However, the drawback is that without top-down connections, the states of the units in the current layer cannot be influenced by the context seen by higher-level units; Second, the recurrent connections increase the network depth while keeping the number of adjustable parameters constant by weight sharing, because AL uses only extra constant parameters of a recurrent kernel size.<br />
<br />
===Volumetric Pyramid Pooling Layer===<br />
<br />
[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]<br />
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer (VPPL) as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. Figure 2 illustrates the structure of VPPL. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:<br />
<br />
M: number of bins <br />
<br />
K: Number of kernels in the last alternative layer.<br />
<br />
This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.<br />
<br />
It reminds me of the spatial pyramid pooling in deep convolutional networks. In CNN, the dimensions of the training data are the same, so that after convolution, we can train the classifiers effectively. To improve the limit of the same dimension, spatial pyramid pooling is introduced.<br />
<br />
==Overall Architecture== <br />
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]<br />
The following are the components of the DANN (as shown in Figure 3)<br />
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps <br />
* 5 ReLU and volumetric pooling layers <br />
* 1 volumetric pyramid pooling layer <br />
* 3 fully connected layers of size 2048 each <br />
* A softmax layer<br />
<br />
==Implementation details==<br />
The authors have used the Torch toolbox platform for implementations of volumetric convolutions, recurrent layers, and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. This technique is preferred to the common alternative of pre-processing data using a sliding window approach to have pre-segmented clips. The authors cite how using this technique limits the amount of data when the windows are not overlapped with one another. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. The final score is the average of all clip-level scores and the crop scores.<br />
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].<br />
<br />
==Evaluations==<br />
===Datasets:===<br />
* The datasets used in the evaluation are UCF101 and HMDB51 <br />
* UCF101 – 13K videos annotated into 101 classes <br />
* HMDB51 – 6.8K videos with 51 actions. <br />
* Three training and test splits are provided <br />
* Performance measured by mean classification accuracy across the splits. <br />
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.<br />
<br />
===Quantitative Results===<br />
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.<br />
<br />
<br />
[[File:Performance Comparison of different input modalities.png]]<br />
<br />
===Qualitative Analysis===<br />
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.<br />
<br />
==Conclusions and Critique==<br />
* Deep alternative neural network is introduced for action recognition.<br />
* The key new component is an "alternative layer" which is composed of a convolutional layer followed by a recurrent layer. As the paper targets action recognition in video, the convolutional layer acts on a 3D spatio-temporal volume.<br />
* DANN consists of volumetric convolutional layer and a recurrent layer. <br />
* A preprocessing stage based on optical flow is used to select video fragments to feed to the neural network.<br />
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches. <br />
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement. <br />
* There are prospects for studying action tube which is a more compact input.<br />
* The paper uses volumetric convolutional layer, but it doesn't say how it is better than recurrent neural networks in exploring temporal information.<br />
* There is no experimental evidence to compare the proposed method with long-term recurrent convolutional network. Also, there is no analysis of time complexity of the approach used.<br />
<br />
Github code: https://github.com/wangjinzhuo/DANN<br />
<br />
In the formal review of the paper [https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html], some interesting criticisms of the paper are surfaced. For starters, one reviewer notes that a similar architecture was proposed in [https://arxiv.org/abs/1511.06432], limiting the novelty of the approach somewhat. The reviewers question the validity of the approach in even slightly more complicated settings (i.e. any non-static camera, which brings in the issue of optical flow). Other criticisms come from a lack of clear motivation for choices that the authors have made, for instance, the use of Local Response Normalization has fallen slightly out-of-favour, or the benefit of using a sliding window approach during testing (and random clips during training).<br />
<br />
Quantitatively, the benefits of the author's approach are not readily apparent. In comparisons with state-of-the-art, the proposed model performs worse on HMDB, and while they claim the highest performance on UCF, the increase is merely .1 over previous best efforts.<br />
<br />
==References==<br />
<br />
[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014 <br />
<br />
[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. <br />
<br />
[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015. <br />
<br />
[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013. <br />
<br />
[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. <br />
<br />
[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016. <br />
<br />
[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015. <br />
<br />
[8] IEEE International Symposium on Multimedia 2013 <br />
<br />
[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action<br />
recognition. arXiv preprint arXiv:1604.04494, 2016<br />
<br />
[10] https://en.wikipedia.org/wiki/Optical_flow<br />
<br />
[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016 <br />
<br />
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 37(9):1904–1916, 2015.<br />
<br />
[13] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l<br />
1 optical flow. In Pattern Recognition, pages 214–223. 2007.<br />
<br />
A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Alternative_Neural_Network:_Exploring_Contexts_As_Early_As_Possible_For_Action_Recognition&diff=31255Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition2017-11-23T16:58:54Z<p>Amanjhunjhunwala: /* Optic Flow */</p>
<hr />
<div>==Introduction==<br />
<br />
Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages. <br />
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]<br />
<br />
The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The networks themselves involve a lot of layers, with the first layer typically having a receptive field (RF) that outputs only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs. In summary, this paper proposes a novel neural network, called deep alternative neural network (DANN), which is a based method for action recognition. The novel component is called an "alternative layer" which is composed of a volumetric convolutional layer followed by a recurrent layer. In addition, the authors also propose a new approach to select network input based on optical flow. The validity of DANN is carried out on HMDB51 and UCF101 datasets and it is observed that the proposed method achieves comparable performance against state of the art methods.<br />
<br />
The main contributions in the paper can be summarized as follows: <br />
* A Deep Alternative Neural Network (DANN) is proposed for action recognition. <br />
* DANN consists of alternative volumetric convolutional and recurrent layers. <br />
* An adaptive method to determine the temporal size of the video clip <br />
* A volumetric pyramid pooling layer to resize the output before fully connected layers.<br />
<br />
===Related Work===<br />
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular, the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This similarity between the two works was noted by Reviewer 1 in the NIPS review process.<br />
<br />
=== Optic Flow ===<br />
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.<br />
It can be used for affordance perception, the ability to discern possibilities for action within the environment. The following image describes the optical flow :[[File:oflow.png | frame]]<br />
<br />
==Deep Alternative Neural Network:==<br />
===Adaptive Network Input===<br />
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with a shorter interval. However, there’s still no systematic way of determining the number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to: <br />
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion <br />
* it is relatively robust to changes in camera viewpoint.<br />
<br />
The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows <br />
<br />
:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math><br />
<br />
Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.<br />
<br />
[[File:golfswing.png]]<br />
<br />
To deal with the different length of video clip, we adopt the idea of spatial pyramid pooling (SPP) in [12] and extend to temporal domain, developing a volumetric pyramid pooling (VPP) layer to transfer video clip of arbitrary size into a universal length in the last alternative layer before fully connected layer.<br />
<br />
===Alternative Layer===<br />
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t, $u_{ij}^{xyz}(t)$, is given by,<br />
<br />
:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\ <br />
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz}) <br />
</math><br />
<br />
U(0): feed forward output of volumetric convolutional layer. <br />
U(t-1) : recurrent input of previous time <br />
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively <br />
f: ReLU function followed by a local response normalization (LRN), which mimics the lateral inhibition in the cortex where different features compete for<br />
large responses. <br />
<br />
Figure 3 depicts this structure:<br />
[[File:unfolded.PNG|1000px]]<br />
<br />
The recurrent connections in AL provide two advantages. First, they enable every unit to incorporate contexts in an arbitrarily large region in the current layer。 However, the drawback is that without top-down connections, the states of the units in the current layer cannot be influenced by the context seen by higher-level units; Second, the recurrent connections increase the network depth while keeping the number of adjustable parameters constant by weight sharing, because AL uses only extra constant parameters of a recurrent kernel size.<br />
<br />
===Volumetric Pyramid Pooling Layer===<br />
<br />
[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]<br />
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer (VPPL) as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. Figure 2 illustrates the structure of VPPL. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:<br />
<br />
M: number of bins <br />
<br />
K: Number of kernels in the last alternative layer.<br />
<br />
This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.<br />
<br />
It reminds me of the spatial pyramid pooling in deep convolutional networks. In CNN, the dimensions of the training data are the same, so that after convolution, we can train the classifiers effectively. To improve the limit of the same dimension, spatial pyramid pooling is introduced.<br />
<br />
==Overall Architecture== <br />
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]<br />
The following are the components of the DANN (as shown in Figure 3)<br />
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps <br />
* 5 ReLU and volumetric pooling layers <br />
* 1 volumetric pyramid pooling layer <br />
* 3 fully connected layers of size 2048 each <br />
* A softmax layer<br />
<br />
==Implementation details==<br />
The authors have used the Torch toolbox platform for implementations of volumetric convolutions, recurrent layers, and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. This technique is preferred to the common alternative of pre-processing data using a sliding window approach to have pre-segmented clips. The authors cite how using this technique limits the amount of data when the windows are not overlapped with one another. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. The final score is the average of all clip-level scores and the crop scores.<br />
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].<br />
<br />
==Evaluations==<br />
===Datasets:===<br />
* The datasets used in the evaluation are UCF101 and HMDB51 <br />
* UCF101 – 13K videos annotated into 101 classes <br />
* HMDB51 – 6.8K videos with 51 actions. <br />
* Three training and test splits are provided <br />
* Performance measured by mean classification accuracy across the splits. <br />
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.<br />
<br />
===Quantitative Results===<br />
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.<br />
<br />
<br />
[[File:Performance Comparison of different input modalities.png]]<br />
<br />
===Qualitative Analysis===<br />
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.<br />
<br />
==Conclusions and Critique==<br />
* Deep alternative neural network is introduced for action recognition.<br />
* The key new component is an "alternative layer" which is composed of a convolutional layer followed by a recurrent layer. As the paper targets action recognition in video, the convolutional layer acts on a 3D spatio-temporal volume.<br />
* DANN consists of volumetric convolutional layer and a recurrent layer. <br />
* A preprocessing stage based on optical flow is used to select video fragments to feed to the neural network.<br />
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches. <br />
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement. <br />
* There are prospects for studying action tube which is a more compact input.<br />
* The paper uses volumetric convolutional layer, but it doesn't say how it is better than recurrent neural networks in exploring temporal information.<br />
* There is no experimental evidence to compare the proposed method with long-term recurrent convolutional network. Also, there is no analysis of time complexity of the approach used.<br />
<br />
Github code: https://github.com/wangjinzhuo/DANN<br />
<br />
In the formal review of the paper [https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html], some interesting criticisms of the paper are surfaced. For starters, one reviewer notes that a similar architecture was proposed in [https://arxiv.org/abs/1511.06432], limiting the novelty of the approach somewhat. The reviewers question the validity of the approach in even slightly more complicated settings (i.e. any non-static camera, which brings in the issue of optical flow). Other criticisms come from a lack of clear motivation for choices that the authors have made, for instance, the use of Local Response Normalization has fallen slightly out-of-favour, or the benefit of using a sliding window approach during testing (and random clips during training).<br />
<br />
Quantitatively, the benefits of the author's approach are not readily apparent. In comparisons with state-of-the-art, the proposed model performs worse on HMDB, and while they claim the highest performance on UCF, the increase is merely .1 over previous best efforts.<br />
<br />
==References==<br />
<br />
[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014 <br />
<br />
[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. <br />
<br />
[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015. <br />
<br />
[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013. <br />
<br />
[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. <br />
<br />
[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016. <br />
<br />
[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015. <br />
<br />
[8] IEEE International Symposium on Multimedia 2013 <br />
<br />
[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action<br />
recognition. arXiv preprint arXiv:1604.04494, 2016<br />
<br />
[10] https://en.wikipedia.org/wiki/Optical_flow<br />
<br />
[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016 <br />
<br />
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 37(9):1904–1916, 2015.<br />
<br />
[13] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l<br />
1 optical flow. In Pattern Recognition, pages 214–223. 2007.<br />
<br />
A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:oflow.png&diff=31254File:oflow.png2017-11-23T16:58:15Z<p>Amanjhunjhunwala: </p>
<hr />
<div></div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Alternative_Neural_Network:_Exploring_Contexts_As_Early_As_Possible_For_Action_Recognition&diff=31253Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition2017-11-23T16:57:52Z<p>Amanjhunjhunwala: /* Optic Flow */</p>
<hr />
<div>==Introduction==<br />
<br />
Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages. <br />
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]<br />
<br />
The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The networks themselves involve a lot of layers, with the first layer typically having a receptive field (RF) that outputs only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs. In summary, this paper proposes a novel neural network, called deep alternative neural network (DANN), which is a based method for action recognition. The novel component is called an "alternative layer" which is composed of a volumetric convolutional layer followed by a recurrent layer. In addition, the authors also propose a new approach to select network input based on optical flow. The validity of DANN is carried out on HMDB51 and UCF101 datasets and it is observed that the proposed method achieves comparable performance against state of the art methods.<br />
<br />
The main contributions in the paper can be summarized as follows: <br />
* A Deep Alternative Neural Network (DANN) is proposed for action recognition. <br />
* DANN consists of alternative volumetric convolutional and recurrent layers. <br />
* An adaptive method to determine the temporal size of the video clip <br />
* A volumetric pyramid pooling layer to resize the output before fully connected layers.<br />
<br />
===Related Work===<br />
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular, the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This similarity between the two works was noted by Reviewer 1 in the NIPS review process.<br />
<br />
=== Optic Flow ===<br />
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.<br />
It can be used for affordance perception, the ability to discern possibilities for action within the environment. The following image describes the optical flow :[[File:oflow.png]]<br />
<br />
==Deep Alternative Neural Network:==<br />
===Adaptive Network Input===<br />
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with a shorter interval. However, there’s still no systematic way of determining the number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to: <br />
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion <br />
* it is relatively robust to changes in camera viewpoint.<br />
<br />
The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows <br />
<br />
:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math><br />
<br />
Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.<br />
<br />
[[File:golfswing.png]]<br />
<br />
To deal with the different length of video clip, we adopt the idea of spatial pyramid pooling (SPP) in [12] and extend to temporal domain, developing a volumetric pyramid pooling (VPP) layer to transfer video clip of arbitrary size into a universal length in the last alternative layer before fully connected layer.<br />
<br />
===Alternative Layer===<br />
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t, $u_{ij}^{xyz}(t)$, is given by,<br />
<br />
:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\ <br />
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz}) <br />
</math><br />
<br />
U(0): feed forward output of volumetric convolutional layer. <br />
U(t-1) : recurrent input of previous time <br />
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively <br />
f: ReLU function followed by a local response normalization (LRN), which mimics the lateral inhibition in the cortex where different features compete for<br />
large responses. <br />
<br />
Figure 3 depicts this structure:<br />
[[File:unfolded.PNG|1000px]]<br />
<br />
The recurrent connections in AL provide two advantages. First, they enable every unit to incorporate contexts in an arbitrarily large region in the current layer。 However, the drawback is that without top-down connections, the states of the units in the current layer cannot be influenced by the context seen by higher-level units; Second, the recurrent connections increase the network depth while keeping the number of adjustable parameters constant by weight sharing, because AL uses only extra constant parameters of a recurrent kernel size.<br />
<br />
===Volumetric Pyramid Pooling Layer===<br />
<br />
[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]<br />
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer (VPPL) as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. Figure 2 illustrates the structure of VPPL. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:<br />
<br />
M: number of bins <br />
<br />
K: Number of kernels in the last alternative layer.<br />
<br />
This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.<br />
<br />
It reminds me of the spatial pyramid pooling in deep convolutional networks. In CNN, the dimensions of the training data are the same, so that after convolution, we can train the classifiers effectively. To improve the limit of the same dimension, spatial pyramid pooling is introduced.<br />
<br />
==Overall Architecture== <br />
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]<br />
The following are the components of the DANN (as shown in Figure 3)<br />
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps <br />
* 5 ReLU and volumetric pooling layers <br />
* 1 volumetric pyramid pooling layer <br />
* 3 fully connected layers of size 2048 each <br />
* A softmax layer<br />
<br />
==Implementation details==<br />
The authors have used the Torch toolbox platform for implementations of volumetric convolutions, recurrent layers, and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. This technique is preferred to the common alternative of pre-processing data using a sliding window approach to have pre-segmented clips. The authors cite how using this technique limits the amount of data when the windows are not overlapped with one another. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. The final score is the average of all clip-level scores and the crop scores.<br />
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].<br />
<br />
==Evaluations==<br />
===Datasets:===<br />
* The datasets used in the evaluation are UCF101 and HMDB51 <br />
* UCF101 – 13K videos annotated into 101 classes <br />
* HMDB51 – 6.8K videos with 51 actions. <br />
* Three training and test splits are provided <br />
* Performance measured by mean classification accuracy across the splits. <br />
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.<br />
<br />
===Quantitative Results===<br />
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.<br />
<br />
<br />
[[File:Performance Comparison of different input modalities.png]]<br />
<br />
===Qualitative Analysis===<br />
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.<br />
<br />
==Conclusions and Critique==<br />
* Deep alternative neural network is introduced for action recognition.<br />
* The key new component is an "alternative layer" which is composed of a convolutional layer followed by a recurrent layer. As the paper targets action recognition in video, the convolutional layer acts on a 3D spatio-temporal volume.<br />
* DANN consists of volumetric convolutional layer and a recurrent layer. <br />
* A preprocessing stage based on optical flow is used to select video fragments to feed to the neural network.<br />
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches. <br />
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement. <br />
* There are prospects for studying action tube which is a more compact input.<br />
* The paper uses volumetric convolutional layer, but it doesn't say how it is better than recurrent neural networks in exploring temporal information.<br />
* There is no experimental evidence to compare the proposed method with long-term recurrent convolutional network. Also, there is no analysis of time complexity of the approach used.<br />
<br />
Github code: https://github.com/wangjinzhuo/DANN<br />
<br />
In the formal review of the paper [https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html], some interesting criticisms of the paper are surfaced. For starters, one reviewer notes that a similar architecture was proposed in [https://arxiv.org/abs/1511.06432], limiting the novelty of the approach somewhat. The reviewers question the validity of the approach in even slightly more complicated settings (i.e. any non-static camera, which brings in the issue of optical flow). Other criticisms come from a lack of clear motivation for choices that the authors have made, for instance, the use of Local Response Normalization has fallen slightly out-of-favour, or the benefit of using a sliding window approach during testing (and random clips during training).<br />
<br />
Quantitatively, the benefits of the author's approach are not readily apparent. In comparisons with state-of-the-art, the proposed model performs worse on HMDB, and while they claim the highest performance on UCF, the increase is merely .1 over previous best efforts.<br />
<br />
==References==<br />
<br />
[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014 <br />
<br />
[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. <br />
<br />
[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015. <br />
<br />
[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013. <br />
<br />
[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. <br />
<br />
[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016. <br />
<br />
[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015. <br />
<br />
[8] IEEE International Symposium on Multimedia 2013 <br />
<br />
[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action<br />
recognition. arXiv preprint arXiv:1604.04494, 2016<br />
<br />
[10] https://en.wikipedia.org/wiki/Optical_flow<br />
<br />
[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016 <br />
<br />
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 37(9):1904–1916, 2015.<br />
<br />
[13] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l<br />
1 optical flow. In Pattern Recognition, pages 214–223. 2007.<br />
<br />
A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Universal_Style_Transfer_via_Feature_Transforms&diff=31252Universal Style Transfer via Feature Transforms2017-11-23T16:53:17Z<p>Amanjhunjhunwala: /* Evaluation */</p>
<hr />
<div>=Introduction=<br />
When viewing an image, whether it is a photograph or a painting, two types of mutually exclusive data are present. First, there is the content of the image, such as a person in a portrait. However, the content does not uniquely define the image. Consider a case where multiple artists paint a portrait of an identical subject, the results would vary despite the content being invariant. The cause of the variance is rooted in the style of each particular artist. Therefore, style transfer between two images results in the content being unaffected but the style being copied. Style transfer is an important image editing task which enables the creation of new artistic works. Typically one image is termed the content/reference image, whose style is discarded. The other image is called the style image, whose style, but the not content is copied to the content image.<br />
<br />
Deep learning techniques have been shown to be effective methods for implementing style transfer. Previous methods have been successful but with several key limitations and often trade off between generalization, quality, and efficiency. Either they are fast, but have very few styles that can be transferred or they can handle arbitrary styles but are no longer efficient. The presented paper establishes a compromise between these two extremes by using only whitening and coloring transforms (WCT) to transfer a style within a feedforward image reconstruction architecture. No training of the underlying deep network is required per style.<br />
<br />
==Style Transfer==<br />
The original paper about neural style transfer suggests a novel application of convolutional filters: transfer the art style to another image. The process is described in the following figure.<br />
<br />
[[File:style_transfer.jpg|600px|center|thumb|Figure: the process of neural style transfer]]<br />
<br />
In the original architecture, the authors used VGG as the "local feature extractor", by minimizing the loss function that measures the difference between the style of the input image and the style of the target image, the network can generate an image with similar features. The key factor in the original paper is that the style similarity between the input image and target image can be measured by Gramian Matrix. The authors defined the loss function as the Gramian Matrix of the activations in different layers. Despite the amazing results, the principle of neural style transfer, especially why the Gram matrices could represent style remains unclear. In the paper[15], the authors theoretically showed that matching the Gram matrices of feature maps is equivalent to minimize the Maximum Mean Discrepancy (MMD) with the second order polynomial kernel. Thus, the authors argue that the essence of neural style transfer is to match the feature distributions between the style images and the generated images.<br />
<br />
=Related Work=<br />
Gatys et al. developed a new method for generating textures from sample images in 2015 [1] and extended their approach to style transfer by 2016 [2]. They proposed the use of a pre-trained convolutional neural network (CNN) to separate content and style of input images. Having proven successful, a number of improvements quickly developed, reducing computational time, increasing the diversity of transferrable styles, and improving the quality of the results. Central to these approaches and of the present paper is the use of a CNN. The disadvantage is the inefficiency in the optimization process. Even though there has been an improvement by formulating the stylizations, these methods require training one network per style due to the lack of generalization in network design.<br />
<br />
In 2017, Mechrez et al. [12] proposed an approach that takes as input a stylized image and makes it more photorealistic. Their approach relied on the Screened Poisson Equation, maintaining the fidelity of the stylized image while constraining the gradients to those of the original input image. The method they proposed was fast, simple, fully automatic and showed positive progress in making a stylized image photorealistic.<br />
<br />
Alternative attempts, by using a single network to transfer<br />
multiple styles include models conditioned on binary selection units [13], a network that learns a set of new filters for every new style [15], and a novel conditional normalization layer that learns normalization parameters for each style [3]<br />
<br />
In comparing their methods with the existing techniques outlined above, the authors cite the close relationship between their work and [7]. In [7] content features in higher layers are adaptively instance normalized by the mean and variance of style features. The authors consider this step to be a sub-optimal operation in the WCT. <br />
==How Content and Style are Extracted using CNNs==<br />
A CNN was chosen due to its ability to extract high level feature from images. These features can be interpreted in two ways. Within layer <math> l </math> there are <math> N_l </math> feature maps of size <math> M_l </math>. With a particular input image, the feature maps are given by <math> F_{i,j}^l </math> where <math> i </math> and <math> j </math> locate the map within the layer. Starting with a white noise image and a reference (content) image, the features can be transferred by minimizing<br />
<br />
<center><br />
<math> \mathcal{L}_{content} = \frac{1}{2} \sum_{i,j} \left( F_{i,j}^l - P_{i,j}^l \right)^2 </math><br />
</center><br />
<br />
where <math> P_{i,j} </math> denotes the feature map output caused by the white noise image. Therefore this loss function preserves the content of the reference image. The style is described using a Gram matrix given by<br />
<br />
<center><br />
<math><br />
G_{i,j}^l = \sum_k F_{i,k}^l F_{j,k}^l<br />
</math><br />
</center><br />
<br />
Gram matrix $G$ of a set of vectors $v_1,\dots,v_n$ is the matrix of all possible inner products whose entries are given by $G_{ij}=v_i^Tv_j$. The loss function that describes a difference in style between two images is equal to:<br />
<br />
<center><br />
<math><br />
\mathcal{L}_{style} = \frac{1}{4 N_l^2 M_l^2} \sum_{i,j} \left(G_{i,j}^l - A_{i,j}^l \right)^2<br />
</math><br />
</center><br />
<br />
where <math> A_{i,j}^l </math> and <math> G_{i,j}^l </math> are the Gram matrices of the generated image and style image respectively. Therefore three images are required, a style image, a content image, and an initial white noise image. Iterative optimization is then used to add content from one image to the white noise image, and style from the other. An additional parameter is used to balance the ratio of these loss functions.<br />
<br />
The 19-layer ImageNet trained VGG network was chosen by Gatys et al. VGG-19 is still commonly used in more recent works as will be shown in the presented paper, although training datasets vary. Such CNNs are typically used in classification problems by finalizing their output through a series of full connected layers. For content and style extraction it is the convolutional layers that are required. The method of Gatys et al. is style independent, since the CNN does not need to be trained for each style image. However, the process of iterative optimization to generate the output image is computationally expensive.<br />
<br />
==Other Methods==<br />
Other methods avoid the inefficiency of iterative optimization by training a network/networks on a set of styles. The network then directly transfers the style from the style image to the content image without solving the iterative optimization problem. V. Dumoulin et al. trained a single network on $N$ styles [3]. This improved upon previous work where a network was required per style [4]. The stylized output image was generated by simply running a feedforward pass of the network on the content image. While efficiency is high, the method is no longer able to apply an arbitrary style without retraining.<br />
<br />
=Methodology=<br />
Li et al. have proposed a novel method for generating the stylized image. A CNN is still used as in Gatys et al. to extract content and style. However, the stylized image is not generated through iterative optimization or a feed-forward pass as required by previous methods. Instead, whitening and colour transforms are used.<br />
<br />
==Image Reconstruction==<br />
[[File:image_resconstruction.png|thumb|150px|right|alt=Training a single decoder.|Training a single decoder. X denotes the layer of the VGG encoder that the decoder receives as input.]]<br />
An auto-encoder network is used to first encode an input image into a set of feature maps, and then decode it back to an image as shown in the adjacent figure. The encoder network used is VGG-19. This network is responsible for obtaining feature maps (similar to Gatys et al.). The output of each of the first five layers is then fed into a corresponding decoder network, which is a mirrored version of VGG-19. Each decoder network then decodes the feature maps of the $l$th layer producing an output image. A mechanism for transferring style will be implemented by manipulating the feature maps between the encoder and decoder networks.<br />
<br />
First, the auto-encoder network needs to be trained. The following loss function is used<br />
<br />
<center><br />
<math><br />
<br />
\mathcal{L} = || I_{output} - I_{input} ||_2^2 + \lambda || \Phi(I_{output}) - \Phi(I_{input})||_2^2<br />
<br />
</math><br />
</center><br />
<br />
where $I_{input}$ and $I_{output}$ are the input and output images of the auto-encoder. $\Phi$ is the VGG encoder. The first term of the loss is the pixel reconstruction loss, while the second term is feature loss. Recall from "Related Work" that the feature maps correspond to the content of the image. Therefore the second term can also be seen as penalising for content differences that arise due to the encoder network. The network was trained using the Microsoft COCO dataset. <br />
<br />
They use whitening and coloring transforms to directly transform the $f_c$ (VGG feature<br />
map of the content image at a certain layer) to $f_{cs}$ such that covariance matrix of $f_s$ (VGG feature<br />
map of style image) is same as covariance matrix of $f_{cs}$. This process consists of two steps, i.e., whitening (make covariance to identity) and coloring (make covariance to $f_s$) transforms. Note that the decoder will reconstruct the original content image if $f_c$ is directly fed into it, but if $f_{cs}$ is fed, it outputs an image with the content of content image and style of style image.<br />
<br />
==Whitening Transform==<br />
Whitening first requires that the covariance of the data is a diagonal matrix. This is done by solving for the covariance matrix's eigenvalues and eigenvector matrices. Whitening then forces the diagonal elements of the eigenvalue matrix to be the same. In other words, whitening transforms the known covariance matrix to an identity matrix such that for given feature map $f_c$, whitening transforms it into $\hat{f}_c$ such that $\hat{f}_c \times \hat{f}_c^T = I$ . This is achieved for a feature map from VGG through the following steps.<br />
<br />
# The feature map $f_c$ is extracted from a layer of the encoder network after activation on the content image. This is the data to be whitened.<br />
# $f_c$ is centered by subtracting its mean vector $m_c$.<br />
# Then, the eigenvectors $E_c$ and eigenvalues $D_c$ are found for the covariance matrix of $f_c$. <br />
# The whitened feature map is then given by $\hat{f}_c = E_c D_c^{-1/2} E_c^T f_c$. <br />
<br />
Note that this is indeed finding the symmetric transformer matrix $A$ in $\hat{f}_c = A f_c$ such that the covariance matrix of $\hat{f}_c$ is an identity matrix. If interested, the derivation of the whitening equation can be seen in [5]. Li et al. found that whitening removed styles from the image.<br />
<br />
==Colour Transform==<br />
It is the inverse of whitening transform i.e. it can transform a random variable to have the desired covariance matrix. However, whitening does not transfer style from the style image. It only uses feature maps from the content image. The colour transform uses both $\hat{f}_c$ from above and $f_s$, the feature map from the style image. Color transform in this case, transforms $\hat{f}_c$ to $f_{cs}$ such that $conv(f_{cs}) = conv(f_s)$, remember that covariance represents the ''style'' information of the image such this steps matches styles per the style image.<br />
<br />
<br />
# $f_s$ is centered by subtracting its mean vector $m_s$.<br />
# Then, the eigenvectors $E_s$ and eigenvalues $D_s$ are calculated for the covariance matrix of $f_s$.<br />
# The colour transform is given by $\hat{f}_{cs} = E_s D_s^{1/2} E_s^T \hat{f}_c$.<br />
# Recenter $\hat{f}_{cs}$ using $m_s$. i.e., $\hat{f}_{cs}$ = $\hat{f}_{cs}$ + $m_s$<br />
<br />
Intuitively, colouring results in a correlation between the $\hat{f}_c$ and $f_s$ feature maps, or rather, $\hat{f}_{cs}$ is a linear transform of the original feature map $f_c$ which takes on the variance of $f_s$. This is where the style transfer takes place.<br />
<br />
==Content/Style Balance==<br />
Using just $\hat{f}_{cs}$ as the input to the decoder may create a result that is too extreme in style. To balance content and style the new parameter $\alpha$ is defined to serve as the style weight to control the transfer effect.<br />
<br />
<center><br />
<math><br />
<br />
\hat{f}_{cs} = \alpha \hat{f}_{cs} + (1 - \alpha) f_c<br />
<br />
</math><br />
</center><br />
<br />
Authors use $\alpha$ = 0.6 in the style transfer experiments.<br />
<br />
==Using Multiple Layers==<br />
It has been previously mentioned that multiple decoders were trained, one for each of the first five layers of the encoder network. Each layer of a CNN perceives features at different levels. Levels close to the input image will detect lower level local features such as edges. Those levels deeper into the network will detect more complex global features. The style transfer algorithm is applied at each of these levels, which yields the question as to which results, as shown below, to use.<br />
<br />
[[File:multilevel_features.png|thumb|700px|center|alt=Results of style transfer from each of the first five layers of the encoder network.|Results of style transfer from each of the first five layers of the encoder network.]]<br />
<br />
Ideally, the results of each layer should be used to build the final output image. This captures the entire range of features detected by the encoder network. First, one full pass of the network is performed. Then the stylised image from the deepest layer (Relu_5_1 in this case) is taken and used as the content image for another iteration of the algorithm, where then the next layer (Relu_4_1) is used as the output. These steps are repeated until the final image is produced from the shallowest layer. This process is summarised in the figure below.<br />
<br />
[[File:process_summary.png|thumb|700px|center|alt=Process summary of the multi-level stylization algorithm.|The content (C) and style (S) are fed to the VGG encoding network. The output image (I) after a whitening and colour transform (WCT) is taken from the deepest level's decoder. The process is iteratively repeated until the most shallow layer is reached.]]<br />
<br />
The authors note that the transformations must be applied first at the highest level (most abstract) layers, which capture complicated local structures and pass this transformed image to lower layers, which improve on details. They observe that reversing this order (lowest to highest) leads to images with low visual quality, as low-level information cannot be preserved after manipulating high level features.<br />
<br />
[[File:Universal_Style_Transfer_Coarse_to_Fine.JPG|thumb|700px|center|alt=(a)-(c) Output from intermediate layers. (d) Reversed transformation order.|(a)-(c) Output from intermediate layers. (d) Reversed transformation order.]]<br />
<br />
=Evaluation=<br />
The success of style transfer might appear hard to quantify as it relies on qualitative judgement. However, the extremes of transferring no style, or transferring only the style can be considered as performing poorly. Consistent transfer of style throughout the entire image is another parameter of success. Ideally, the viewer can recognize the content of the image, while seeing it expressed in an alternative style. Quantitatively, the quality of the style transfer can be calculated by taking the covariance matrix difference $L_s$ between the resulting image and the original style. The results of the presented paper also need to be considered within the contexts of generality, efficiency and training requirements.<br />
<br />
The implementation for this paper can be found on Github at:<br />
<br />
* Torch (official) : https://github.com/Yijunmaverick/UniversalStyleTransfer<br />
* Keras : https://github.com/eridgd/WCT-TF<br />
* PyTorch : https://github.com/sunshineatnoon/PytorchWCT<br />
<br />
==Style Transfer==<br />
A number of style transfer examples are presented relative to other works. <br />
<br />
[[File:transfer_results_label.jpg|thumb|700px|center|alt=Style transfer results of the presented paper.|A: See [6]. B: See [7]. C: See [8]. D: Gatys et al. iterative optimization, see [2]. E: This paper's results.]]<br />
<br />
Li et al. then obtained the average $L_s$ using 10 random content images across 40 style images. They had the lowest average $log(L_s)$ of all referenced works at 6.3. Next lowest was Gatys et al. [2] with $log(L_s) = 6.7$. It should be noted that while $L_s$ quantitatively calculates the success of the style transfer, results are still subject to the viewer's impression. Reviewing the transfer results, rows five and six for Gatys et al.'s method shows local minimization issues. However, their method still achieves a competitive $L_s$ score.<br />
<br />
Since the qualitative assessment is highly subjective, a user study was conducted to evaluate 5 methods shown in Figure 6. The percentage of the votes each method received is shown in<br />
Table 2 (2nd row). It shows that the method presented in this paper receives the most votes for better stylized results.<br />
<center><br />
[[File:style_transfer_table_2.png]]<br />
</center><br />
<br />
==Transfer Efficiency==<br />
It was hypothesized by Li et al. that using WCT would enable faster run-times than [2] while still supporting arbitrary style transfer. For a 256x256 image, using a 12GB TITAN X, they achieved a transfer time of 1.5 seconds. Gatys et al.'s method [2] required 21.2 seconds. The pure feed-forward approaches [7], and [8] had times equal to or less than 0.2 seconds. [6] had a time comparable to the presented paper's method. However, [6,7,8] do not generalize well to multiple styles as training is required. Therefore this paper obtained a near 15x speed up for a style agnostic transfer algorithm when compared to leading previous work. The authors also note that WCT was done using the CPU. They intend to port WCT to the GPU and expect to see the computational time be further reduced.<br />
<br />
==Other Applications==<br />
Li et al.'s method can also be used for texture synthesis. This was the original work of Gatys et. al. before they applied their algorithm to style transfer problems. Texture synthesis takes a reference texture/image and creates new textures from it. With proper boundary conditions enforced these synthesized textures can be tileable. Alternatively, higher resolution textures can be generated. Texture synthesis has applications in areas such as computer graphics, allowing for large surfaces to be texture mapped.<br />
<br />
The content image is set as white noise, similar to how [2] initializes their output image. Then the reference texture/image is set as the style image. Since the content image is initially random white noise, then the features generated by the encoder of this image are also random. Li et al. state that this increases the diversity of the resulting output textures.<br />
<br />
[[File:texture_synthesis_label.jpg|thumb|700px|center|alt=Texture synthesis results.|A: Reference image/texture. B: Result from [8]. C: Result of present paper.]]<br />
<br />
Reviewing the examples from the above figure, it can be observed that the method from this paper repeats fewer local features from the image than a competing feed forward network method [8]. While the analysis is qualitative, the authors claim that their method produces "more visually pleasing results".<br />
<br />
=Conclusion=<br />
Only a couple of years ago were CNNs first used to stylize images. Today, a host of improvements have been developed, optimizing the original work of Gatys et al. for a number of different situations. Using additional training per style image, computational efficiency and image quality can be increased. However, the trained network then depends on that specific style image, or in some cases such as in [3], a set of style images. Till now, limited work has taken place in improving Gatys et al.'s method for arbitrary style images. The authors of this paper developed and evaluated a novel method for arbitrary style transfer in which they present a multi-level stylization pipeline, which takes all level of information of a style into account, for improved results. In addition, the proposed approach is shown to be equally effective for texture synthesis. Their method and Gatys et al.'s method share the use of a VGG-19 CNN as the initial processing step. However, the authors replaced iterative optimization with whitening and colour transforms, which can be applied in a single step. This yields a decrease in computational time while maintaining generality with respect to the style image. After their CNN auto-encoder is initially trained no further training is required. This allows their method to be style agnostic. Their method also performs favourably, in terms of image quality, when compared to other current work.<br />
<br />
=Critique=<br />
In the paper, the authors only experimented with layers of VGG19. Given that architectures such as ResNet and Xception perform better on image recognition tasks, it would be interesting to see how residual layers and/or Inception modules may be applied to the task of disentangling style and content and whether they would improve performance relative to the results presented in the current paper is the encoder used were to utilize layers from these alternative convolutional architectures. Additionally, it is worth exploring whether one can invent a probabilistic and/or generative version of the encoder-decoder architecture used in the paper. More precisely, is it possible to come up with something in the spirit of variational autoencoders, wherein we the bottleneck layer can be used to sample noise vectors, which can then be input into each of the decoder units to generate synthetic style and content images?<br />
Alternative attempts would also involve the study of generative adversarial networks with a perturbation threshold value. GANs can produce surreal images, where the underlying structure (content) is preserved ( in CNNs the filters learn the edges and surfaces and shape of the image), provided the Discriminator is trained for style classification ( training set consists of images pertaining the style that requires to be transferred).<br />
<br />
=Additional Results and Figures=<br />
Given in this section are the additional figures of universal style transform found in the supplementary file. They are typically for larger image sizes and more variety of styles.<br />
#[[File:style-1.PNG]]<br />
#[[File:style-2.PNG]]<br />
#[[File:style-3.PNG]]<br />
<br />
=References=<br />
[1] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.<br />
<br />
[2] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.<br />
<br />
[3] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.<br />
<br />
[4] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016<br />
<br />
[5] R. Picard. MAS 622J/1.126J: Pattern Recognition and Analysis, Lecture 4. http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf<br />
<br />
[6] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016.<br />
<br />
[7] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. arXiv preprint arXiv:1703.06868, 2017.<br />
<br />
[8] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.<br />
<br />
[9] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, https://arxiv.org/abs/1508.06576<br />
<br />
[10] Karen Simonyan et al. Very Deep Convolutional Networks for Large-Scale Image Recognition<br />
<br />
[11] VGG Architectures - [http://www.robots.ox.ac.uk/~vgg/research/very_deep/| More Details]<br />
<br />
[12] Mechrez, R., Shechtman, E., & Zelnik-Manor, L. (2017). Photorealistic Style Transfer with Screened Poisson Equation. arXiv preprint arXiv:1709.09828.<br />
<br />
[13] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In CVPR, 2017<br />
<br />
[14] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In CVPR, 2017<br />
<br />
Implementation Example: https://github.com/titu1994/Neural-Style-Transfer<br />
<br />
[15] Li, Yanghao, Naiyan Wang, Jiaying Liu and Xiaodi Hou. “Demystifying Neural Style Transfer.” IJCAI (2017).</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=%22Why_Should_I_Trust_You%3F%22:_Explaining_the_Predictions_of_Any_Classifier&diff=31251"Why Should I Trust You?": Explaining the Predictions of Any Classifier2017-11-23T16:46:17Z<p>Amanjhunjhunwala: /* Can I trust this model? */</p>
<hr />
<div>==Introduction==<br />
<br />
Understanding why machine learning models behave the way they do helps users in model selection, feature engineering, and finally trusting the model to deploy it. In many cases even though a non-interpretable model is more accurate on the validation data set than the interpretable one, an interpretable model is chosen. But restricting the models to just an interpretable model isn't the best option. In this paper, the authors argue for explaining machine learning predictions using model-agnostic approaches and propose LIME (Local Interpretable Model-Agnostic Explanations), a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. They also propose a method called sub-modular optimization to explain models globally by selecting a representative individual prediction. <br />
<br />
Some recent work aims to anticipate failures in machine learning, specifically for vision tasks. Failure modes are found by grouping incorrectly classified images according to a semantic attribute space, which are descriptions explaining the cause of the failure [12]. Another approach is to train a model that can predict failure [14]. Letting users know when the systems are likely to fail can lead to an increase in trust, by avoiding “silly mistakes” [13]. These solutions either require additional annotations and feature engineering that is specific to vision tasks or do not provide insight into why a decision should not be trusted. Furthermore, they assume that the current evaluation metrics are reliable, which may not be the case if problems such as data leakage are present. This is not the first work to look at the problems with relying on validation set accuracy as the primary measure of trust. This area has been well studied. Practitioners often overestimate their model's accuracy, propagate feedback loops, and fail to notice data leaks. Common solutions in the literature are Gestalt and Modeltracker. The LIME technique complements these tools. <br />
<br />
The experiments are conducted mainly on text data and image data, where models used are Random Forest (RF) and Convolutional Neural Networks (CNN). The experiments highlight utility of explanations in deciding between the models, which model is to be trusted and which is not and also improving the model based on the explanations. The main contributions of this paper are summarized as follows.<br />
<br />
* LIME, an algorithm to explain any prediction of any model (classifier or regressor) by approximating locally with an interpretable model that is faithful to the original model locally. <br />
<br />
* SP-LIME, a method to select a final set of representative samples using sub-modular optimization to show to the user that pretty much captures what the original model is doing globally. <br />
<br />
* A detailed set of experiments with simulated subjects to prove the usefulness of LIME and SP-LIME. <br />
<br />
[[File:LIME.jpg|thumb|550px|Figure 1: Explaining individual predictions. A model predicts that a patient has the flu, and LIME highlights the symptoms in the patient’s history that led to the prediction. Sneeze and headache are portrayed as contributing to the “flu” prediction, while “no fatigue” is evidence against it. With these, a doctor can make an informed decision about whether to trust the model’s prediction.]]<br />
<br />
== Need for Explanations ==<br />
<br />
Prediction explanation basically means answering question such as "What triggers the model to make such a prediction?", and an answer to that would be to say, these are the features with a certain range of values in your input sample that are contributing to the prediction the most. For example, it can be words in a document for text classification or patches of pixels in an image for image classification. And these explanations need to make sense to the user and hence has to be simple. Figure 1 illustrates an explanation procedure. In this case, an explanation is a small weighted list of symptoms that either contribute to the prediction (in green) or are evidence against it (in red). It is very clear that the life of a doctor is much easier in terms of making a decision with the help of a model if intelligible explanations are provided. This is similar to the topic of inference versus prediction. Some people tend to think of these two terms are the same. However, statistically, their meanings are very different. Inference means given some datasets, we would like to infer how the output/response is generated based on a sequence of explanatory variables. For instance, we would like to know/infer how education level affects people’s income level. Prediction, on the other hand, is by using a given dataset, we fit/train a model that will correctly predict the outcome of a new observation. <br />
<br />
Even with highest accuracy on validation dataset, we sometimes can't judge how the model is going to behave on other datasets. There are many reasons why this can happen. For example, Data leakage where some of the features used for training are heavily correlated with the target value in both training and validation data that might result in great train and validation results but will be of no use if used on altogether a new dataset. But if explanations such as the one in Figure 1 are given, it becomes easy to fix the fault by removing the heavily correlated feature from the data and train. This is how you convert an untrustworthy model to a trustworthy one. Another problem is dataset shift, where train and test data come from different distributions and providing explanations for predictions will help the user to take measures to make model generalize better.<br />
<br />
Figure 2 shows how explanations help in selecting between the models in addition to accuracy measures. In this case, by looking at the explanations it is easy to say that model with higher accuracy is in-fact the worst. Further, many a times, there can be difference between what we want the model to optimize and what the model is actually optimizing which might result into model not behaving as well as expected (For example in binary classification using logistic regression, all we are concerned is with accuracy, but the model is trying to reduce the log-loss due to the learning algorithms limitations). While we may not be able to quantitatively measure what difference has that made, but we still have some intuitions as to how the model has to behave with respect to some of its features, and if explanations are given for model predictions, we will be able to ascertain if those explanations make any sense. Also explaining models being model agnostic helps as we then compare different classes of models in making a choice.<br />
<br />
[[File:explanation_example.png|thumb|550px|Figure 2: Explaining individual predictions of competing classifiers trying to determine if a document is about “Christianity” or “Atheism”. The bar chart represents the importance given to the most relevant words, also highlighted in the text. Color indicates which class the word contributes to (green for “Christianity”, magenta for “Atheism”).]]<br />
<br />
===Desired Characteristics for Explainers===<br />
<br />
Explainers have to be simple models so that humans are able to comprehend what the model is doing. For example, even with linear models, if the number of features are too many, it becomes hard for humans to comprehend an overwhelming set of features. In case of models trained with word-embedding as features, we can't give these word-embeddings as explanations, instead, they need to something different than these features.<br />
<br />
Another requirement of a good explainer is that it has to be locally faithful to the original model, i.e, it must behave similarly to the model in the vicinity of the instance being predicted. And these local explanations should aid in forming global explanations.<br />
<br />
===Desired Characteristics for Explanations===<br />
<br />
'''Interpretable'''(understanding between the input and response)<br />
* Take into account user limitations.<br />
* Since features in a machine learning model need not be interpretable, the input to the explanations may have to be different from input to the model.<br />
<br />
'''Local Fidelity'''<br />
* Explanation should be locally faithful, ie it should correspond to how the model behaves in the vicinity of the instance being predicted. It doesn't mean global fidelity, where global important features may not be important in the local context.<br />
<br />
'''Model Agnostic'''<br />
* Treat the original, given model as a black box.<br />
<br />
'''Global Perspective'''(explain the model)<br />
* Select a few predictions such that they represent the entire model.<br />
<br />
==Local Interpretable Model-Agnostic Explanations (LIME)==<br />
<br />
The overall goal of LIME is to basically identify an interpretable model over the interpretable representation (features) that is locally faithful to the classifier. Here, interpretable explanations<br />
need to use a representation that is understandable to humans, regardless of the actual features used by the model. For example in case of text classification, a possible interpretable representation of data would be to use bag of words (one-hot encodings) instead of the word embeddings. Similarly, for image classification, it may be a binary vector (one-hot encoding) indicating the “presence” or “absence” of a contiguous patch of pixels (a super-pixel or a segment), while the classifier may represent the image as a tensor with three color channels per pixel. Let $x ∈ R^d$ be the original representation of an instance being explained, and $x' ∈ \{0, 1\}^{d'}$ be a binary vector for its interpretable representation.<br />
<br />
=== Fidelity-Interpretability Trade-Off ===<br />
Formally, the authors define an explanation as a model $g ∈ G$, where $G$ is a class of potentially interpretable models, such as linear models, decision trees, or falling rule lists [3]. The domain of $g$ is $\{0, 1\}^{d'}$, i.e. $g$ acts over absence/presence of the interpretable components. Let $Ω(g)$ be a measure of complexity (like a regularizer term) of the explanation $g ∈ G$. For example, for decision trees $Ω(g)$ may be the depth of the tree, for linear models, $Ω(g)$ may be the number of non-zero weights. Let the model being explained be denoted $f : R^d → R$. In classification, $f(x)$ is the probability that $x$ belongs to a certain class. We further use $π_x(z)$ as a proximity measure between an instance $z$ to $x$, so as to define locality around $x$. Finally, let $L(f, g, π_x)$ be a measure of error between $g$ and $f$ in the locality defined by $π_x$. Our task is to minimize $L(f, g, π_x)$ while having $Ω(g)$ be low enough to be interpretable by humans. The minimum value of $L(f, g, π_x)$ provides an approximate balance measure for preserving interpretability and local fidelity(it should correspond to how the model behaves in the vicinity of the instance being predicted). The explanation produced by LIME is obtained by the following:<br />
<br />
:<math>\xi(x) = \underset{g\in\mathbb{G}}{\operatorname{argmin}}\, \mathcal{L}(f,g,\pi_x) + \Omega(g)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, Eq. (1)</math><br />
<br />
Even though $G$ can be any class of interpretable models, in this paper only sparse linear models are considered.<br />
<br />
=== Sampling for Local Exploration ===<br />
<br />
The aim is to minimize $L(f, g, π_x)$ without making any assumptions about $f$, since the explainer needs to be model-agnostic. In order to learn the local behavior of $f$ using $g$ as the interpretable inputs vary, $L(f, g, π_x)$ is approximated by drawing samples, weighted by $π_x$. We sample instances around $x'$ by setting few features of $x'$ to 0, uniformly at random (where the number of such features set to 0 is also uniformly sampled). Given a perturbed sample $z' ∈ \{0, 1\}^{d'}$ (which contains a fraction of the features of $x'$ set to 0), we recover the sample in the original representation $z ∈ R^d$ and obtain the prediction $f(z)$ from the original classifier, and is used as a label for the explanation model. Given this dataset $Z$ of perturbed samples with the associated labels, we optimize Eq. (1) to get an explanation $ξ(x)$. The primary intuition behind LIME is presented in Figure 3, where we sample instances both in the vicinity of $x$ (which have a high weight due to $π_x$) and far away from $x$ (low weight from $π_x$). Even though the original model may be too complex to explain globally, LIME presents an explanation that is locally faithful (linear in this case), where the locality is captured by a weight factor, $π_x$. A concrete example of this process is presented in the next section.<br />
<br />
[[File:decision_boundary.png|thumb|Figure 3: Toy example to present intuition for LIME. The black-box model’s complex decision function $f$ (unknown to LIME) is represented by the blue/pink background, which cannot be approximated well by a linear model. The bold red cross is the instance being explained. LIME samples instances, get predictions using $f$, and weighs them by the proximity to the instance being explained (represented here by size). The dashed line is the learned explanation that is locally (but not globally) faithful.]]<br />
<br />
===Sparse Linear Explanations===<br />
<br />
For the rest of this paper, we let $G$ be the class of linear models, such that $g(z') = w_g · z'$ . We use the locally weighted square loss as $L$, as defined in Eq. (2), where we let $π_x(z) = exp(−D(x, z)^2 /σ^2 )$ be an exponential kernel defined on some distance function $D$ (e.g. cosine distance for text, L2 distance for images) with width $σ$ resulting in higher sample weights for perturbed samples that are in the vicinity of sample being explained and lower weights for samples that are far away.<br />
<br />
:<math>\mathcal{L}(f,g,\pi_x) = \underset{z,z'\in\mathbb{Z}}{\operatorname{\Sigma}} \pi_x(z)\,(f(z) - g(z'))^2\,\,\,\,\,\,\,\,\,\,\,\, Eq.(2)</math><br />
<br />
For text classification, the features used for explanation models are bag of words, and we set a limit on the number of words used for explanation, $K$. i.e. $Ω(g) = ∞1[||w_g||_0 > K]$. We use the same $Ω$ for image classification, using “super-pixels” (computed using any standard image segmentation algorithm, eg. quickshift() function in skimage python library [5]) instead of words, such that the interpretable representation of an image is a binary vector where 1 indicates the original super-pixel and 0 indicates a grayed out super-pixel. This particular choice of $Ω$ makes directly solving Eq. (1) intractable, but we approximate it by first selecting $K$ features with Lasso (using the regularization path [4]) (scikit lars_path method [6]) and then learning the weights via least squares (together a procedure we call K-LASSO in Algorithm 1). . <br />
<br />
There are some limitations to class of locally interpretable models $G$. It can happen that no $g \in G$ is powerful enough to be locally faithful to the original model, i.e, $g$ has a high bias w.r.t $f$ locally. This may be because the underlying model is quite complex even locally. Another issue could be, the features used for building explanations are not representative of the inputs or factors that the underlying model relies on. For example, a model that predicts sepia-toned images to be retro cannot be explained by the presence or absence of super pixels.<br />
<br />
<br />
[[File:algorithm1.png|thumb|550px]]<br />
<br />
==Examples==<br />
<br />
===Example 1: Text classification with SVMs===<br />
<br />
In Figure 2 (right side), LIME explains an SVM classifier with RBF kernel trained on "20 newsgroup dataset" to classify documents into "Atheism" and "Christianity". Even though it achieves 94% accuracy on validation dataset, the explanation for an instance shows that features that are used the most in predictions are very arbitrary such as words like "Posting", "Host" and "Re", which have no connection to either Christianity or Atheism. Even if these stop words or headers are removed, the classifier still considers the proper names of people writing the post to be important features, which clearly doesn't make sense. Hence we can conclude by looking at the explanations that the dataset has serious issues and when a classifier is trained on it, it won't be able to generalize well. Hence suitable steps needed to be taken to train a trustworthy classifier.<br />
<br />
===Example 2: Deep networks for images===<br />
<br />
In this example, we use sparse linear explanations to explain Google’s pre-trained Inception neural network [7] which is an image classifier. Here features used for giving explanations are super-pixels (segments) which intuitively makes sense as we humans detect an object in an image based on certain parts of the object. Figure 4a show an arbitrary image that we want to explain. Figures 4b, 4c, 4d show the super-pixels that contribute the most while predicting the image to be one of the top 3 classes. Here we set max feature count for explainer, $K = 10$. It is quite intuitive from Figure 4b in particular that why acoustic guitar was predicted to be an electric based on the fretboard that resembles an acoustic guitar. So even though the prediction is incorrect, it still is not unreasonable at all.<br />
<br />
[[File:inception_example.png|thumb|center|550px]]<br />
<br />
==Submodular Pick for Explaining Models==<br />
<br />
Submodular pick provides some sort of a global understanding of the model by methodically picking a set of non-redundant sample explanations. This method is still model agnostic and complimentary to calculating validation accuracy in machine learning problem. If we choose a large number of instances to explain, the user may not be able to go through each of them and verify what is wrong or right with the model. We say humans have a budget $B$, that is the number of instance explanations that they are willing to examine. So given a set of $X$ instances, the task of sub-modular optimization is to pick a maximum of $B$ instances that effectively capture the essence of what the model is doing in a non-redundant way.<br />
<br />
Here we have a explanations for $n$ instances and we construct an $n × d'$ matrix $W$, each row of which represent local importances of features used for each instance explanations. In case of using sparse linear models as explanations, for an instance $x_i$ and explanation $g_i = ξ(x_i)$, we set $W_{ij} = |w_{gij}|$. Further, for each feature (column) $j$ in $W$, let $I_j$ denote the global importance of that component in the explanation space. Intuitively, $I$ represents global importance of each feature. In Figure 5, we show a toy example $W$, with $n = d' = 5$, where $W$ is binary (for simplicity). Since feature $f2$ is used in most of the explanations, $I$ should have higher value for feature $f2$ than $f1$, i.e. $I_2 > I_1$. Formally for the text classifiers, we set $I_j = \sqrt{\sum_{i=1}^nW_{ij}}$ . For images, $I$ must represent something that can be compared across the super-pixels (segments) in different images, such as color histograms or other features of super-pixels; But this is left as a future work for now.<br />
<br />
While picking the samples, we must be careful in not picking up samples that explain the same thing. So we want a minimum set of samples that cover the maximum number of features as possible. In Figure 5, after the second row is picked, the third row is redundant, as the user has already seen features $f2$ and $f3$ - while the last row gives the user completely new features. Hence selecting the second and last row results in giving the maximum coverage of features. Eq. (3) formalizes this process, where we define coverage as the set function $c$ that, given $W$ and $I$, computes the total importance of the features that at least appear in one instance in a set $V$ .<br />
<br />
:<math>c(V, W, I) = {{\sum_{j = 1}^{d'}}}\mathbb{1}_{[\exists i\in V:W_{ij}\gt 0]}I_j\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, Eq. (3)</math><br />
<br />
The pick problem, defined in Eq. (4), consists of finding the set $V, |V| ≤ B$ that achieves the highest coverage.<br />
<br />
:<math>Pick(W, I) = \underset{V,|V|\leq B}{\operatorname{argmax}}\, c(V,W, I)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, Eq. (4)</math><br />
<br />
The problem in Eq. (4) is maximizing a weighted coverage function, and is NP-hard [10]. Let $c(V\cup\{i\}, W, I)−c(V, W, I)$ be the marginal coverage gain of adding an instance $i$ to a set $V$ . Due to sub-modularity, a greedy algorithm that iteratively adds the instance with the highest marginal coverage gain to the solution offers a constant-factor approximation guarantee of $1−1/e$ to the optimum [8]. This approximation process is outlined in Algorithm 2 and is called sub-modular pick.<br />
<br />
[[File:algorithm2.png|thumb|550px]] <br />
[[File:toy_example.png|thumb|Figure 5: Toy example W. Rows represent instances (documents) and columns represent features (words). Feature f2 (dotted blue) has the highest importance. Rows 2 and 5 (in red) would be selected by the pick procedure, covering all but feature f1.]]<br />
<br />
==Simulated User Experiments==<br />
<br />
This section highlights simulated user experiments conducted to evaluate how explanations help the user in different tasks. In particular, following questions are addressed: (1) Are the explanations faithful to the model, that is are the explainers correctly highlighting the important features that the model is selecting at least locally? (2) Can the explanations help users in ascertaining trust in predictions; that is to say, if the predictions make sense or not? and (3) Are the explanations useful for evaluating the model as a whole and thereby aid in finally selecting a model? Code and data for experiments are available at https://github.com/marcotcr/lime-experiments.<br />
<br />
===Experiment Setup===<br />
<br />
2 sentiment analysis datasets (books and DVDs, 2000 instances each) where the task is to classify product reviews as positive or negative [9] are used in the experiment. Decision trees (DT), logistic regression with L2 regularization (LR), nearest neighbors (NN), and SVMs with RBF kernel, all using bag of words as features are trained separately. There is also a random forest (with 1000 trees) trained over average word2vec embeddings (RF). Unless otherwise noted, default methods of scikit are used. Dataset is divided into train and test in the ratio of 4:1. To explain individual predictions, LIME is compared with parzen [10], a method that approximates the black box classifier globally with Parzen windows, and explains individual predictions by taking the gradient of the prediction probability function. (Parzen-window density estimation is essentially a data-interpolation technique. Given an instance of the random sample, ${\bf x}$, Parzen-windowing estimates the PDF $P(X)$ from which the sample was derived. It essentially superposes kernel functions placed at each observation or datum. In this way, each observation $x_i$ contributes to the PDF estimate [15]). For parzen, $K$ features are selected as explanations such that they have the highest absolute gradients. Hyper-parameters for LIME and parzen are set using cross-validations and number of perturbed samples for a single explanation, $N$ is set to 15000. There is also a comparison made against a greedy procedure in which features that contribute the most to the predicted class are greedily removed until prediction changes or we reach a maximum of $K$ features that are removed, and these $K$ or fewer features are finally selected to explain the model. Also a random procedure that randomly picks K features as an explanation is also used for comparison. For all the experiments below $K$ is set to 10. For experiments where the pick procedure applies, it's either a random pick (RP) or sub-modular pick (SP). We refer to pick-explainer combinations by adding RP or SP as a prefix.<br />
<br />
[[File:recall1.png|thumb|Figure 6: Recall [11] on truly important features for two interpretable classifiers on the books dataset.]] [[File:recall2.png|thumb|Figure 7: Recall on truly important features for two interpretable classifiers on the DVDs dataset.]]<br />
<br />
===Are explanations faithful to the model?===<br />
<br />
Here we evaluate faithfulness of explanations on classifiers that are by themselves explicable such as sparse logistic regression or decision trees. So we train 2 models, a sparse logistic regressor and a decision tree such that the maximum number of features they use 10 so we now know what the important features (gold features) really are for these 2 models. For each prediction on the test set, we generate explanations and compute the fraction of these gold features that are recovered by the explanations (recall). We report this recall averaged over all the test instances in Figures 6 and 7. From the bar graphs we can see that the greedy approach is comparable to parzen on logistic regression, but is substantially worse on decision trees since changing a single feature at a time often does not have an effect on the prediction. The overall recall by parzen is low, likely due to the difficulty in approximating the original high-dimensional classifier. LIME consistently provides > 90% recall for both classifiers on both datasets, and this demonstrates that LIME explanations are more faithful to the model than any other explanations.<br />
<br />
===Should I trust this prediction?===<br />
<br />
In order to simulate trust in individual predictions, we first randomly select 25% of the features to be “untrustworthy”, and assume that the users can identify and would not want to trust these features (such as the headers in 20 newsgroups, leaked data, etc). We thus develop oracle “trustworthiness” by labeling test set predictions from a black box classifier as “untrustworthy” if the prediction changes when untrustworthy features are removed from the instance, and “trustworthy” otherwise. In order to simulate users, we assume that users deem predictions untrustworthy from LIME and parzen explanations if the prediction from the linear approximation changes when all untrustworthy features that appear in the explanations are removed (the simulated human “discounts” the effect of untrustworthy features). For greedy and random, the prediction is mistrusted if any untrustworthy features are present in the explanation since these methods do not provide a notion of the contribution of each feature to the prediction. Thus for each test set prediction, we can evaluate whether the simulated user trusts it using each explanation method, and compare it to the trustworthiness oracle. <br />
<br />
Using this setup, we report the score, F1 (average of precision and recall) on the trustworthy predictions for each explanation method, averaged over 100 runs, in Table 1. The results indicate that LIME dominates others (all results are significant at p = 0.01) on both datasets, and for all of the black box models. The other methods either achieve a lower recall (i.e. they mistrust predictions more than they should) or lower precision (i.e. they trust too many predictions), while LIME maintains both high precision and high recall. Even though we artificially select which features are untrustworthy, these results indicate that LIME is helpful in assessing trust in individual predictions.<br />
<br />
[[File:table1.png|thumb|Table 1: Average F1 of trustworthiness for different explainers on a collection of classifiers and datasets.]][[File:choose_classifier.png|thumb|Figure 8: Choosing between two classifiers, as the number of instances shown to a simulated user is varied. Averages and standard errors from 800 runs.]]<br />
<br />
===Can I trust this model?===<br />
<br />
In the final simulated user experiment, we consider a case where the user has to pick between 2 competing models with similar accuracy on validation dataset. For this purpose, we add 10 artificially “noisy” features. Specifically, on training and validation sets (80/20 split of the original training data), each artificial feature appears in 10% of the examples in one class, and 20% of the other, while on the test instances, each artificial feature appears in 10% of the examples in each class. This recreates the situation where the models use not only features that are informative in the real world, but also ones that introduce spurious correlations. We create pairs of competing classifiers by repeatedly training pairs of random forests with 30 trees until their validation accuracy is within 0.1% of each other, but their test accuracy differs by at least 5%. Thus, it is not possible to identify the better classifier (the one with higher test accuracy) from the accuracy on the validation data. <br />
<br />
The goal of this experiment is to evaluate whether a user can identify the better classifier based on the explanations of $B$ instances from the validation set. The simulated human marks the set of artificial features that appear in the $B$ explanations as untrustworthy, following which we evaluate how many total predictions in the validation set should be trusted (as in the previous section, treating only marked features as untrustworthy). Then, we select the classifier with fewer untrustworthy predictions and compare this choice to the classifier with higher held-out test set accuracy. <br />
<br />
We present the accuracy of picking the correct classifier as $B$ varies, averaged over 800 runs, in Figure 8. We omit SP-parzen and RP-parzen from the figure since they did not produce useful explanations, performing only slightly better than random. LIME is consistently better than greedy, irrespective of the pick method. Further, combining sub-modular pick with LIME outperforms all other methods, in particular, it is much better than RP-LIME when only a few examples are shown to the users. These results demonstrate that the trust assessments provided by SP-selected LIME explanations are good indicators of generalization.<br />
<br />
==Conclusion and Future Work==<br />
<br />
The paper evaluates its approach on a series of simulated and human-in-the-loop tasks to check:<br />
* The predictions of any classifier in an interpretable and faithful manner.<br />
* The method to explain its models by obtaining individual predictions and their explanations.<br />
* Could the predictions be trusted.<br />
* Can the model be trusted.<br />
* Can users select the best classifier given the explanations.<br />
* Can user (non-experts) improve the classifier by means of feature selection.<br />
* Can explanations lead to insights about the model itself.<br />
<br />
This paper successfully argues for the importance of explaining the predictions to make the model more trustworthy to the user. It proposes LIME, a comprehensive approach to faithfully explain predictions of any model in the simplest way desired. Detailed experiments demonstrating how explanations helped users pick models or discard models are provided. As a future consideration, authors want to repeat the experiments with decision trees being the interpretable model instead of sparse linear models. Authors also want to come up with an approach to perform the pick step for images. They also want to look into an automated way of selecting hyper-parameters used in LIME and SP-LIME such as K, N etc. and finally consider running the code on GPU to create explanations in real-time.<br />
<br />
==Remarks and Critique==<br />
<br />
This paper is by far the best paper I have read on model interpretability. The paper is full of new ideas and the authors clearly explain the approaches used with reasons as to why they picked it and also highlight both weaknesses and strong points of the approaches. Even though some of the tricks used are a little ambiguous on paper such as K-Lasso, super-pixels, but going through the code on GitHub, it becomes more clear. The code is also very well documented and easy to install and run to reproduce the results published. I tried running it on CIFAR-10 dataset and found it to be useful in understanding my model better.<br />
<br />
There are few minor details that are unanswered in the paper.<br />
<br />
* To interpret prediction for an image classifier, authors use super-pixels (segments) as features, but they don't give any reason as to why they picked this approach. One of the reasons I can think of is that convolutional neural networks learn shapes from layer to layer, which becomes more and more complete as we move from top to lower layers. So in a way you can say, the classifier classifies on the basis of set of groups of pixels close to each other (as in the case of segments) than just the pixels that are spread apart far away from each other.<br />
<br />
* The authors don't say how locally faithful the interpretable model should be to the classifier with constraints on the maximum number of features to be used, i.e, what is the sort of error (mean squared error in case of linear interpretable models) we are content with?<br />
<br />
* It seems that LIME can easily extended to regression cases and is not just limited to classification tasks, but the paper doesn't discuss anything in this regard.<br />
<br />
This paper is also comparable to the DeepLIFT[17] method, which provides a global numerical recovery of sample manifold. Both methods rely heavily on the reference point chosen. LIME has the advantage of matching the behavior of the model with interpretable explanations, while DeepLIFT focuses on the model structure and yields overall assessment of the model by manually feeding several typical reference inputs into the method.<br />
<br />
In this paper, the experiments with Deep Convolutional Neural Network try to find the pixels which contribute most to the final output. As a complement reference, this paper[16] also focuses on the interpretability of CNN models by sharing the same idea. In that work, the authors use Global Averaging Pooling layer after the last convolutional layer. Then a heat map is generated by a weighted sum of the outputs from last convolutional layer. By plugging back to the input image, the area that contributes most to the output is highlighted. <br />
<br />
<gallery widths=300px heights=300px><br />
File:CAM.png|thumb|left|550px|Figure 9: example output from CAM<br />
File:cam2.jpg|thumb|right|695px|Figure 10: example output from CAM<br />
</gallery><br />
<br />
As a result, we can actually tell what is the model learning. Examples are shown in Figure 9, 10.<br />
<br />
==References==<br />
[1] Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin, Model-Agnostic Interpretability of Machine Learning, presented at 2016 ICML Workshop on Human Interpretability in Machine Learning (WHI 2016), New York, NY<br />
<br />
[2] Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin, “Why Should I Trust You?” Explaining the Predictions of Any Classifier, KDD 2016 San Francisco, CA, USA<br />
<br />
[3] F. Wang and C. Rudin. Falling rule lists. In Artificial Intelligence and Statistics (AISTATS), 2015.<br />
<br />
[4] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004.<br />
<br />
[5] http://scikit-image.org/docs/dev/api/skimage.segmentation.html#skimage.segmentation.quickshift<br />
<br />
[6] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.lars_path.html<br />
<br />
[7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.<br />
<br />
[8] A. Krause and D. Golovin. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems. Cambridge University Press, February 2014.<br />
<br />
[9] J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Association for Computational Linguistics (ACL), 2007.<br />
<br />
[10] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Muller. How to explain individual classification decisions. Journal of Machine Learning Research, 11, 2010.<br />
<br />
[11] https://en.wikipedia.org/wiki/Precision_and_recall<br />
<br />
[12] A. Bansal, A. Farhadi, and D. Parikh. Towards transparent systems: Semantic characterization of failure modes. In European Conference on Computer Vision (ECCV), 2014.<br />
<br />
[13] M. T. Dzindolet, S. A. Peterson, R. A. Pomranky, L. G. Pierce, and H. P. Beck. The role of trust in automation reliance. Int. J. Hum.-Comput. Stud., 58(6), 2003.<br />
<br />
[14] P. Zhang, J. Wang, A. Farhadi, M. Hebert, and D. Parikh. Predicting failures of vision systems. In Computer Vision and Pattern Recognition (CVPR), 2014.<br />
<br />
[15] Parzen Density Windows :- https://www.cs.utah.edu/~suyash/Dissertation_html/node11.html<br />
<br />
[16] Zhou, Bolei et al. “Learning Deep Features for Discriminative Localization.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 2921-2929.<br />
<br />
[17] Avanti Shrikumar, Peyton Greenside, Anshul Kundaje: Learning Important Features Through Propagating Activation Differences ([https://arxiv.org/abs/1704.02685 link])</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=%22Why_Should_I_Trust_You%3F%22:_Explaining_the_Predictions_of_Any_Classifier&diff=31250"Why Should I Trust You?": Explaining the Predictions of Any Classifier2017-11-23T16:46:05Z<p>Amanjhunjhunwala: /* Conclusion and Future Work */</p>
<hr />
<div>==Introduction==<br />
<br />
Understanding why machine learning models behave the way they do helps users in model selection, feature engineering, and finally trusting the model to deploy it. In many cases even though a non-interpretable model is more accurate on the validation data set than the interpretable one, an interpretable model is chosen. But restricting the models to just an interpretable model isn't the best option. In this paper, the authors argue for explaining machine learning predictions using model-agnostic approaches and propose LIME (Local Interpretable Model-Agnostic Explanations), a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. They also propose a method called sub-modular optimization to explain models globally by selecting a representative individual prediction. <br />
<br />
Some recent work aims to anticipate failures in machine learning, specifically for vision tasks. Failure modes are found by grouping incorrectly classified images according to a semantic attribute space, which are descriptions explaining the cause of the failure [12]. Another approach is to train a model that can predict failure [14]. Letting users know when the systems are likely to fail can lead to an increase in trust, by avoiding “silly mistakes” [13]. These solutions either require additional annotations and feature engineering that is specific to vision tasks or do not provide insight into why a decision should not be trusted. Furthermore, they assume that the current evaluation metrics are reliable, which may not be the case if problems such as data leakage are present. This is not the first work to look at the problems with relying on validation set accuracy as the primary measure of trust. This area has been well studied. Practitioners often overestimate their model's accuracy, propagate feedback loops, and fail to notice data leaks. Common solutions in the literature are Gestalt and Modeltracker. The LIME technique complements these tools. <br />
<br />
The experiments are conducted mainly on text data and image data, where models used are Random Forest (RF) and Convolutional Neural Networks (CNN). The experiments highlight utility of explanations in deciding between the models, which model is to be trusted and which is not and also improving the model based on the explanations. The main contributions of this paper are summarized as follows.<br />
<br />
* LIME, an algorithm to explain any prediction of any model (classifier or regressor) by approximating locally with an interpretable model that is faithful to the original model locally. <br />
<br />
* SP-LIME, a method to select a final set of representative samples using sub-modular optimization to show to the user that pretty much captures what the original model is doing globally. <br />
<br />
* A detailed set of experiments with simulated subjects to prove the usefulness of LIME and SP-LIME. <br />
<br />
[[File:LIME.jpg|thumb|550px|Figure 1: Explaining individual predictions. A model predicts that a patient has the flu, and LIME highlights the symptoms in the patient’s history that led to the prediction. Sneeze and headache are portrayed as contributing to the “flu” prediction, while “no fatigue” is evidence against it. With these, a doctor can make an informed decision about whether to trust the model’s prediction.]]<br />
<br />
== Need for Explanations ==<br />
<br />
Prediction explanation basically means answering question such as "What triggers the model to make such a prediction?", and an answer to that would be to say, these are the features with a certain range of values in your input sample that are contributing to the prediction the most. For example, it can be words in a document for text classification or patches of pixels in an image for image classification. And these explanations need to make sense to the user and hence has to be simple. Figure 1 illustrates an explanation procedure. In this case, an explanation is a small weighted list of symptoms that either contribute to the prediction (in green) or are evidence against it (in red). It is very clear that the life of a doctor is much easier in terms of making a decision with the help of a model if intelligible explanations are provided. This is similar to the topic of inference versus prediction. Some people tend to think of these two terms are the same. However, statistically, their meanings are very different. Inference means given some datasets, we would like to infer how the output/response is generated based on a sequence of explanatory variables. For instance, we would like to know/infer how education level affects people’s income level. Prediction, on the other hand, is by using a given dataset, we fit/train a model that will correctly predict the outcome of a new observation. <br />
<br />
Even with highest accuracy on validation dataset, we sometimes can't judge how the model is going to behave on other datasets. There are many reasons why this can happen. For example, Data leakage where some of the features used for training are heavily correlated with the target value in both training and validation data that might result in great train and validation results but will be of no use if used on altogether a new dataset. But if explanations such as the one in Figure 1 are given, it becomes easy to fix the fault by removing the heavily correlated feature from the data and train. This is how you convert an untrustworthy model to a trustworthy one. Another problem is dataset shift, where train and test data come from different distributions and providing explanations for predictions will help the user to take measures to make model generalize better.<br />
<br />
Figure 2 shows how explanations help in selecting between the models in addition to accuracy measures. In this case, by looking at the explanations it is easy to say that model with higher accuracy is in-fact the worst. Further, many a times, there can be difference between what we want the model to optimize and what the model is actually optimizing which might result into model not behaving as well as expected (For example in binary classification using logistic regression, all we are concerned is with accuracy, but the model is trying to reduce the log-loss due to the learning algorithms limitations). While we may not be able to quantitatively measure what difference has that made, but we still have some intuitions as to how the model has to behave with respect to some of its features, and if explanations are given for model predictions, we will be able to ascertain if those explanations make any sense. Also explaining models being model agnostic helps as we then compare different classes of models in making a choice.<br />
<br />
[[File:explanation_example.png|thumb|550px|Figure 2: Explaining individual predictions of competing classifiers trying to determine if a document is about “Christianity” or “Atheism”. The bar chart represents the importance given to the most relevant words, also highlighted in the text. Color indicates which class the word contributes to (green for “Christianity”, magenta for “Atheism”).]]<br />
<br />
===Desired Characteristics for Explainers===<br />
<br />
Explainers have to be simple models so that humans are able to comprehend what the model is doing. For example, even with linear models, if the number of features are too many, it becomes hard for humans to comprehend an overwhelming set of features. In case of models trained with word-embedding as features, we can't give these word-embeddings as explanations, instead, they need to something different than these features.<br />
<br />
Another requirement of a good explainer is that it has to be locally faithful to the original model, i.e, it must behave similarly to the model in the vicinity of the instance being predicted. And these local explanations should aid in forming global explanations.<br />
<br />
===Desired Characteristics for Explanations===<br />
<br />
'''Interpretable'''(understanding between the input and response)<br />
* Take into account user limitations.<br />
* Since features in a machine learning model need not be interpretable, the input to the explanations may have to be different from input to the model.<br />
<br />
'''Local Fidelity'''<br />
* Explanation should be locally faithful, ie it should correspond to how the model behaves in the vicinity of the instance being predicted. It doesn't mean global fidelity, where global important features may not be important in the local context.<br />
<br />
'''Model Agnostic'''<br />
* Treat the original, given model as a black box.<br />
<br />
'''Global Perspective'''(explain the model)<br />
* Select a few predictions such that they represent the entire model.<br />
<br />
==Local Interpretable Model-Agnostic Explanations (LIME)==<br />
<br />
The overall goal of LIME is to basically identify an interpretable model over the interpretable representation (features) that is locally faithful to the classifier. Here, interpretable explanations<br />
need to use a representation that is understandable to humans, regardless of the actual features used by the model. For example in case of text classification, a possible interpretable representation of data would be to use bag of words (one-hot encodings) instead of the word embeddings. Similarly, for image classification, it may be a binary vector (one-hot encoding) indicating the “presence” or “absence” of a contiguous patch of pixels (a super-pixel or a segment), while the classifier may represent the image as a tensor with three color channels per pixel. Let $x ∈ R^d$ be the original representation of an instance being explained, and $x' ∈ \{0, 1\}^{d'}$ be a binary vector for its interpretable representation.<br />
<br />
=== Fidelity-Interpretability Trade-Off ===<br />
Formally, the authors define an explanation as a model $g ∈ G$, where $G$ is a class of potentially interpretable models, such as linear models, decision trees, or falling rule lists [3]. The domain of $g$ is $\{0, 1\}^{d'}$, i.e. $g$ acts over absence/presence of the interpretable components. Let $Ω(g)$ be a measure of complexity (like a regularizer term) of the explanation $g ∈ G$. For example, for decision trees $Ω(g)$ may be the depth of the tree, for linear models, $Ω(g)$ may be the number of non-zero weights. Let the model being explained be denoted $f : R^d → R$. In classification, $f(x)$ is the probability that $x$ belongs to a certain class. We further use $π_x(z)$ as a proximity measure between an instance $z$ to $x$, so as to define locality around $x$. Finally, let $L(f, g, π_x)$ be a measure of error between $g$ and $f$ in the locality defined by $π_x$. Our task is to minimize $L(f, g, π_x)$ while having $Ω(g)$ be low enough to be interpretable by humans. The minimum value of $L(f, g, π_x)$ provides an approximate balance measure for preserving interpretability and local fidelity(it should correspond to how the model behaves in the vicinity of the instance being predicted). The explanation produced by LIME is obtained by the following:<br />
<br />
:<math>\xi(x) = \underset{g\in\mathbb{G}}{\operatorname{argmin}}\, \mathcal{L}(f,g,\pi_x) + \Omega(g)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, Eq. (1)</math><br />
<br />
Even though $G$ can be any class of interpretable models, in this paper only sparse linear models are considered.<br />
<br />
=== Sampling for Local Exploration ===<br />
<br />
The aim is to minimize $L(f, g, π_x)$ without making any assumptions about $f$, since the explainer needs to be model-agnostic. In order to learn the local behavior of $f$ using $g$ as the interpretable inputs vary, $L(f, g, π_x)$ is approximated by drawing samples, weighted by $π_x$. We sample instances around $x'$ by setting few features of $x'$ to 0, uniformly at random (where the number of such features set to 0 is also uniformly sampled). Given a perturbed sample $z' ∈ \{0, 1\}^{d'}$ (which contains a fraction of the features of $x'$ set to 0), we recover the sample in the original representation $z ∈ R^d$ and obtain the prediction $f(z)$ from the original classifier, and is used as a label for the explanation model. Given this dataset $Z$ of perturbed samples with the associated labels, we optimize Eq. (1) to get an explanation $ξ(x)$. The primary intuition behind LIME is presented in Figure 3, where we sample instances both in the vicinity of $x$ (which have a high weight due to $π_x$) and far away from $x$ (low weight from $π_x$). Even though the original model may be too complex to explain globally, LIME presents an explanation that is locally faithful (linear in this case), where the locality is captured by a weight factor, $π_x$. A concrete example of this process is presented in the next section.<br />
<br />
[[File:decision_boundary.png|thumb|Figure 3: Toy example to present intuition for LIME. The black-box model’s complex decision function $f$ (unknown to LIME) is represented by the blue/pink background, which cannot be approximated well by a linear model. The bold red cross is the instance being explained. LIME samples instances, get predictions using $f$, and weighs them by the proximity to the instance being explained (represented here by size). The dashed line is the learned explanation that is locally (but not globally) faithful.]]<br />
<br />
===Sparse Linear Explanations===<br />
<br />
For the rest of this paper, we let $G$ be the class of linear models, such that $g(z') = w_g · z'$ . We use the locally weighted square loss as $L$, as defined in Eq. (2), where we let $π_x(z) = exp(−D(x, z)^2 /σ^2 )$ be an exponential kernel defined on some distance function $D$ (e.g. cosine distance for text, L2 distance for images) with width $σ$ resulting in higher sample weights for perturbed samples that are in the vicinity of sample being explained and lower weights for samples that are far away.<br />
<br />
:<math>\mathcal{L}(f,g,\pi_x) = \underset{z,z'\in\mathbb{Z}}{\operatorname{\Sigma}} \pi_x(z)\,(f(z) - g(z'))^2\,\,\,\,\,\,\,\,\,\,\,\, Eq.(2)</math><br />
<br />
For text classification, the features used for explanation models are bag of words, and we set a limit on the number of words used for explanation, $K$. i.e. $Ω(g) = ∞1[||w_g||_0 > K]$. We use the same $Ω$ for image classification, using “super-pixels” (computed using any standard image segmentation algorithm, eg. quickshift() function in skimage python library [5]) instead of words, such that the interpretable representation of an image is a binary vector where 1 indicates the original super-pixel and 0 indicates a grayed out super-pixel. This particular choice of $Ω$ makes directly solving Eq. (1) intractable, but we approximate it by first selecting $K$ features with Lasso (using the regularization path [4]) (scikit lars_path method [6]) and then learning the weights via least squares (together a procedure we call K-LASSO in Algorithm 1). . <br />
<br />
There are some limitations to class of locally interpretable models $G$. It can happen that no $g \in G$ is powerful enough to be locally faithful to the original model, i.e, $g$ has a high bias w.r.t $f$ locally. This may be because the underlying model is quite complex even locally. Another issue could be, the features used for building explanations are not representative of the inputs or factors that the underlying model relies on. For example, a model that predicts sepia-toned images to be retro cannot be explained by the presence or absence of super pixels.<br />
<br />
<br />
[[File:algorithm1.png|thumb|550px]]<br />
<br />
==Examples==<br />
<br />
===Example 1: Text classification with SVMs===<br />
<br />
In Figure 2 (right side), LIME explains an SVM classifier with RBF kernel trained on "20 newsgroup dataset" to classify documents into "Atheism" and "Christianity". Even though it achieves 94% accuracy on validation dataset, the explanation for an instance shows that features that are used the most in predictions are very arbitrary such as words like "Posting", "Host" and "Re", which have no connection to either Christianity or Atheism. Even if these stop words or headers are removed, the classifier still considers the proper names of people writing the post to be important features, which clearly doesn't make sense. Hence we can conclude by looking at the explanations that the dataset has serious issues and when a classifier is trained on it, it won't be able to generalize well. Hence suitable steps needed to be taken to train a trustworthy classifier.<br />
<br />
===Example 2: Deep networks for images===<br />
<br />
In this example, we use sparse linear explanations to explain Google’s pre-trained Inception neural network [7] which is an image classifier. Here features used for giving explanations are super-pixels (segments) which intuitively makes sense as we humans detect an object in an image based on certain parts of the object. Figure 4a show an arbitrary image that we want to explain. Figures 4b, 4c, 4d show the super-pixels that contribute the most while predicting the image to be one of the top 3 classes. Here we set max feature count for explainer, $K = 10$. It is quite intuitive from Figure 4b in particular that why acoustic guitar was predicted to be an electric based on the fretboard that resembles an acoustic guitar. So even though the prediction is incorrect, it still is not unreasonable at all.<br />
<br />
[[File:inception_example.png|thumb|center|550px]]<br />
<br />
==Submodular Pick for Explaining Models==<br />
<br />
Submodular pick provides some sort of a global understanding of the model by methodically picking a set of non-redundant sample explanations. This method is still model agnostic and complimentary to calculating validation accuracy in machine learning problem. If we choose a large number of instances to explain, the user may not be able to go through each of them and verify what is wrong or right with the model. We say humans have a budget $B$, that is the number of instance explanations that they are willing to examine. So given a set of $X$ instances, the task of sub-modular optimization is to pick a maximum of $B$ instances that effectively capture the essence of what the model is doing in a non-redundant way.<br />
<br />
Here we have a explanations for $n$ instances and we construct an $n × d'$ matrix $W$, each row of which represent local importances of features used for each instance explanations. In case of using sparse linear models as explanations, for an instance $x_i$ and explanation $g_i = ξ(x_i)$, we set $W_{ij} = |w_{gij}|$. Further, for each feature (column) $j$ in $W$, let $I_j$ denote the global importance of that component in the explanation space. Intuitively, $I$ represents global importance of each feature. In Figure 5, we show a toy example $W$, with $n = d' = 5$, where $W$ is binary (for simplicity). Since feature $f2$ is used in most of the explanations, $I$ should have higher value for feature $f2$ than $f1$, i.e. $I_2 > I_1$. Formally for the text classifiers, we set $I_j = \sqrt{\sum_{i=1}^nW_{ij}}$ . For images, $I$ must represent something that can be compared across the super-pixels (segments) in different images, such as color histograms or other features of super-pixels; But this is left as a future work for now.<br />
<br />
While picking the samples, we must be careful in not picking up samples that explain the same thing. So we want a minimum set of samples that cover the maximum number of features as possible. In Figure 5, after the second row is picked, the third row is redundant, as the user has already seen features $f2$ and $f3$ - while the last row gives the user completely new features. Hence selecting the second and last row results in giving the maximum coverage of features. Eq. (3) formalizes this process, where we define coverage as the set function $c$ that, given $W$ and $I$, computes the total importance of the features that at least appear in one instance in a set $V$ .<br />
<br />
:<math>c(V, W, I) = {{\sum_{j = 1}^{d'}}}\mathbb{1}_{[\exists i\in V:W_{ij}\gt 0]}I_j\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, Eq. (3)</math><br />
<br />
The pick problem, defined in Eq. (4), consists of finding the set $V, |V| ≤ B$ that achieves the highest coverage.<br />
<br />
:<math>Pick(W, I) = \underset{V,|V|\leq B}{\operatorname{argmax}}\, c(V,W, I)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, Eq. (4)</math><br />
<br />
The problem in Eq. (4) is maximizing a weighted coverage function, and is NP-hard [10]. Let $c(V\cup\{i\}, W, I)−c(V, W, I)$ be the marginal coverage gain of adding an instance $i$ to a set $V$ . Due to sub-modularity, a greedy algorithm that iteratively adds the instance with the highest marginal coverage gain to the solution offers a constant-factor approximation guarantee of $1−1/e$ to the optimum [8]. This approximation process is outlined in Algorithm 2 and is called sub-modular pick.<br />
<br />
[[File:algorithm2.png|thumb|550px]] <br />
[[File:toy_example.png|thumb|Figure 5: Toy example W. Rows represent instances (documents) and columns represent features (words). Feature f2 (dotted blue) has the highest importance. Rows 2 and 5 (in red) would be selected by the pick procedure, covering all but feature f1.]]<br />
<br />
==Simulated User Experiments==<br />
<br />
This section highlights simulated user experiments conducted to evaluate how explanations help the user in different tasks. In particular, following questions are addressed: (1) Are the explanations faithful to the model, that is are the explainers correctly highlighting the important features that the model is selecting at least locally? (2) Can the explanations help users in ascertaining trust in predictions; that is to say, if the predictions make sense or not? and (3) Are the explanations useful for evaluating the model as a whole and thereby aid in finally selecting a model? Code and data for experiments are available at https://github.com/marcotcr/lime-experiments.<br />
<br />
===Experiment Setup===<br />
<br />
2 sentiment analysis datasets (books and DVDs, 2000 instances each) where the task is to classify product reviews as positive or negative [9] are used in the experiment. Decision trees (DT), logistic regression with L2 regularization (LR), nearest neighbors (NN), and SVMs with RBF kernel, all using bag of words as features are trained separately. There is also a random forest (with 1000 trees) trained over average word2vec embeddings (RF). Unless otherwise noted, default methods of scikit are used. Dataset is divided into train and test in the ratio of 4:1. To explain individual predictions, LIME is compared with parzen [10], a method that approximates the black box classifier globally with Parzen windows, and explains individual predictions by taking the gradient of the prediction probability function. (Parzen-window density estimation is essentially a data-interpolation technique. Given an instance of the random sample, ${\bf x}$, Parzen-windowing estimates the PDF $P(X)$ from which the sample was derived. It essentially superposes kernel functions placed at each observation or datum. In this way, each observation $x_i$ contributes to the PDF estimate [15]). For parzen, $K$ features are selected as explanations such that they have the highest absolute gradients. Hyper-parameters for LIME and parzen are set using cross-validations and number of perturbed samples for a single explanation, $N$ is set to 15000. There is also a comparison made against a greedy procedure in which features that contribute the most to the predicted class are greedily removed until prediction changes or we reach a maximum of $K$ features that are removed, and these $K$ or fewer features are finally selected to explain the model. Also a random procedure that randomly picks K features as an explanation is also used for comparison. For all the experiments below $K$ is set to 10. For experiments where the pick procedure applies, it's either a random pick (RP) or sub-modular pick (SP). We refer to pick-explainer combinations by adding RP or SP as a prefix.<br />
<br />
[[File:recall1.png|thumb|Figure 6: Recall [11] on truly important features for two interpretable classifiers on the books dataset.]] [[File:recall2.png|thumb|Figure 7: Recall on truly important features for two interpretable classifiers on the DVDs dataset.]]<br />
<br />
===Are explanations faithful to the model?===<br />
<br />
Here we evaluate faithfulness of explanations on classifiers that are by themselves explicable such as sparse logistic regression or decision trees. So we train 2 models, a sparse logistic regressor and a decision tree such that the maximum number of features they use 10 so we now know what the important features (gold features) really are for these 2 models. For each prediction on the test set, we generate explanations and compute the fraction of these gold features that are recovered by the explanations (recall). We report this recall averaged over all the test instances in Figures 6 and 7. From the bar graphs we can see that the greedy approach is comparable to parzen on logistic regression, but is substantially worse on decision trees since changing a single feature at a time often does not have an effect on the prediction. The overall recall by parzen is low, likely due to the difficulty in approximating the original high-dimensional classifier. LIME consistently provides > 90% recall for both classifiers on both datasets, and this demonstrates that LIME explanations are more faithful to the model than any other explanations.<br />
<br />
===Should I trust this prediction?===<br />
<br />
In order to simulate trust in individual predictions, we first randomly select 25% of the features to be “untrustworthy”, and assume that the users can identify and would not want to trust these features (such as the headers in 20 newsgroups, leaked data, etc). We thus develop oracle “trustworthiness” by labeling test set predictions from a black box classifier as “untrustworthy” if the prediction changes when untrustworthy features are removed from the instance, and “trustworthy” otherwise. In order to simulate users, we assume that users deem predictions untrustworthy from LIME and parzen explanations if the prediction from the linear approximation changes when all untrustworthy features that appear in the explanations are removed (the simulated human “discounts” the effect of untrustworthy features). For greedy and random, the prediction is mistrusted if any untrustworthy features are present in the explanation since these methods do not provide a notion of the contribution of each feature to the prediction. Thus for each test set prediction, we can evaluate whether the simulated user trusts it using each explanation method, and compare it to the trustworthiness oracle. <br />
<br />
Using this setup, we report the score, F1 (average of precision and recall) on the trustworthy predictions for each explanation method, averaged over 100 runs, in Table 1. The results indicate that LIME dominates others (all results are significant at p = 0.01) on both datasets, and for all of the black box models. The other methods either achieve a lower recall (i.e. they mistrust predictions more than they should) or lower precision (i.e. they trust too many predictions), while LIME maintains both high precision and high recall. Even though we artificially select which features are untrustworthy, these results indicate that LIME is helpful in assessing trust in individual predictions.<br />
<br />
[[File:table1.png|thumb|Table 1: Average F1 of trustworthiness for different explainers on a collection of classifiers and datasets.]][[File:choose_classifier.png|thumb|Figure 8: Choosing between two classifiers, as the number of instances shown to a simulated user is varied. Averages and standard errors from 800 runs.]]<br />
<br />
===Can I trust this model?===<br />
<br />
In the final simulated user experiment, we consider a case where the user has to pick between 2 competing models with similar accuracy on validation dataset. For this purpose, we add 10 artificially “noisy” features. Specifically, on training and validation sets (80/20 split of the original training data), each artificial feature appears in 10% of the examples in one class, and 20% of the other, while on the test instances, each artificial feature appears in 10% of the examples in each class. This recreates the situation where the models use not only features that are informative in the real world, but also ones that introduce spurious correlations. We create pairs of competing classifiers by repeatedly training pairs of random forests with 30 trees until their validation accuracy is within 0.1% of each other, but their test accuracy differs by at least 5%. Thus, it is not possible to identify the better classifier (the one with higher test accuracy) from the accuracy on the validation data. <br />
<br />
The goal of this experiment is to evaluate whether a user can identify the better classifier based on the explanations of $B$ instances from the validation set. The simulated human marks the set of artificial features that appear in the $B$ explanations as untrustworthy, following which we evaluate how many total predictions in the validation set should be trusted (as in the previous section, treating only marked features as untrustworthy). Then, we select the classifier with fewer untrustworthy predictions and compare this choice to the classifier with higher held-out test set accuracy. <br />
<br />
We present the accuracy of picking the correct classifier as $B$ varies, averaged over 800 runs, in Figure 8. We omit SP-parzen and RP-parzen from the figure since they did not produce useful explanations, performing only slightly better than random. LIME is consistently better than greedy, irrespective of the pick method. Further, combining sub-modular pick with LIME outperforms all other methods, in particular, it is much better than RP-LIME when only a few examples are shown to the users. These results demonstrate that the trust assessments provided by SP-selected LIME explanations are good indicators of generalization."""<br />
<br />
==Conclusion and Future Work==<br />
<br />
The paper evaluates its approach on a series of simulated and human-in-the-loop tasks to check:<br />
* The predictions of any classifier in an interpretable and faithful manner.<br />
* The method to explain its models by obtaining individual predictions and their explanations.<br />
* Could the predictions be trusted.<br />
* Can the model be trusted.<br />
* Can users select the best classifier given the explanations.<br />
* Can user (non-experts) improve the classifier by means of feature selection.<br />
* Can explanations lead to insights about the model itself.<br />
<br />
This paper successfully argues for the importance of explaining the predictions to make the model more trustworthy to the user. It proposes LIME, a comprehensive approach to faithfully explain predictions of any model in the simplest way desired. Detailed experiments demonstrating how explanations helped users pick models or discard models are provided. As a future consideration, authors want to repeat the experiments with decision trees being the interpretable model instead of sparse linear models. Authors also want to come up with an approach to perform the pick step for images. They also want to look into an automated way of selecting hyper-parameters used in LIME and SP-LIME such as K, N etc. and finally consider running the code on GPU to create explanations in real-time.<br />
<br />
==Remarks and Critique==<br />
<br />
This paper is by far the best paper I have read on model interpretability. The paper is full of new ideas and the authors clearly explain the approaches used with reasons as to why they picked it and also highlight both weaknesses and strong points of the approaches. Even though some of the tricks used are a little ambiguous on paper such as K-Lasso, super-pixels, but going through the code on GitHub, it becomes more clear. The code is also very well documented and easy to install and run to reproduce the results published. I tried running it on CIFAR-10 dataset and found it to be useful in understanding my model better.<br />
<br />
There are few minor details that are unanswered in the paper.<br />
<br />
* To interpret prediction for an image classifier, authors use super-pixels (segments) as features, but they don't give any reason as to why they picked this approach. One of the reasons I can think of is that convolutional neural networks learn shapes from layer to layer, which becomes more and more complete as we move from top to lower layers. So in a way you can say, the classifier classifies on the basis of set of groups of pixels close to each other (as in the case of segments) than just the pixels that are spread apart far away from each other.<br />
<br />
* The authors don't say how locally faithful the interpretable model should be to the classifier with constraints on the maximum number of features to be used, i.e, what is the sort of error (mean squared error in case of linear interpretable models) we are content with?<br />
<br />
* It seems that LIME can easily extended to regression cases and is not just limited to classification tasks, but the paper doesn't discuss anything in this regard.<br />
<br />
This paper is also comparable to the DeepLIFT[17] method, which provides a global numerical recovery of sample manifold. Both methods rely heavily on the reference point chosen. LIME has the advantage of matching the behavior of the model with interpretable explanations, while DeepLIFT focuses on the model structure and yields overall assessment of the model by manually feeding several typical reference inputs into the method.<br />
<br />
In this paper, the experiments with Deep Convolutional Neural Network try to find the pixels which contribute most to the final output. As a complement reference, this paper[16] also focuses on the interpretability of CNN models by sharing the same idea. In that work, the authors use Global Averaging Pooling layer after the last convolutional layer. Then a heat map is generated by a weighted sum of the outputs from last convolutional layer. By plugging back to the input image, the area that contributes most to the output is highlighted. <br />
<br />
<gallery widths=300px heights=300px><br />
File:CAM.png|thumb|left|550px|Figure 9: example output from CAM<br />
File:cam2.jpg|thumb|right|695px|Figure 10: example output from CAM<br />
</gallery><br />
<br />
As a result, we can actually tell what is the model learning. Examples are shown in Figure 9, 10.<br />
<br />
==References==<br />
[1] Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin, Model-Agnostic Interpretability of Machine Learning, presented at 2016 ICML Workshop on Human Interpretability in Machine Learning (WHI 2016), New York, NY<br />
<br />
[2] Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin, “Why Should I Trust You?” Explaining the Predictions of Any Classifier, KDD 2016 San Francisco, CA, USA<br />
<br />
[3] F. Wang and C. Rudin. Falling rule lists. In Artificial Intelligence and Statistics (AISTATS), 2015.<br />
<br />
[4] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004.<br />
<br />
[5] http://scikit-image.org/docs/dev/api/skimage.segmentation.html#skimage.segmentation.quickshift<br />
<br />
[6] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.lars_path.html<br />
<br />
[7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.<br />
<br />
[8] A. Krause and D. Golovin. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems. Cambridge University Press, February 2014.<br />
<br />
[9] J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Association for Computational Linguistics (ACL), 2007.<br />
<br />
[10] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Muller. How to explain individual classification decisions. Journal of Machine Learning Research, 11, 2010.<br />
<br />
[11] https://en.wikipedia.org/wiki/Precision_and_recall<br />
<br />
[12] A. Bansal, A. Farhadi, and D. Parikh. Towards transparent systems: Semantic characterization of failure modes. In European Conference on Computer Vision (ECCV), 2014.<br />
<br />
[13] M. T. Dzindolet, S. A. Peterson, R. A. Pomranky, L. G. Pierce, and H. P. Beck. The role of trust in automation reliance. Int. J. Hum.-Comput. Stud., 58(6), 2003.<br />
<br />
[14] P. Zhang, J. Wang, A. Farhadi, M. Hebert, and D. Parikh. Predicting failures of vision systems. In Computer Vision and Pattern Recognition (CVPR), 2014.<br />
<br />
[15] Parzen Density Windows :- https://www.cs.utah.edu/~suyash/Dissertation_html/node11.html<br />
<br />
[16] Zhou, Bolei et al. “Learning Deep Features for Discriminative Localization.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 2921-2929.<br />
<br />
[17] Avanti Shrikumar, Peyton Greenside, Anshul Kundaje: Learning Important Features Through Propagating Activation Differences ([https://arxiv.org/abs/1704.02685 link])</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=%22Why_Should_I_Trust_You%3F%22:_Explaining_the_Predictions_of_Any_Classifier&diff=31249"Why Should I Trust You?": Explaining the Predictions of Any Classifier2017-11-23T16:45:18Z<p>Amanjhunjhunwala: /* Should I trust this prediction? */</p>
<hr />
<div>==Introduction==<br />
<br />
Understanding why machine learning models behave the way they do helps users in model selection, feature engineering, and finally trusting the model to deploy it. In many cases even though a non-interpretable model is more accurate on the validation data set than the interpretable one, an interpretable model is chosen. But restricting the models to just an interpretable model isn't the best option. In this paper, the authors argue for explaining machine learning predictions using model-agnostic approaches and propose LIME (Local Interpretable Model-Agnostic Explanations), a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. They also propose a method called sub-modular optimization to explain models globally by selecting a representative individual prediction. <br />
<br />
Some recent work aims to anticipate failures in machine learning, specifically for vision tasks. Failure modes are found by grouping incorrectly classified images according to a semantic attribute space, which are descriptions explaining the cause of the failure [12]. Another approach is to train a model that can predict failure [14]. Letting users know when the systems are likely to fail can lead to an increase in trust, by avoiding “silly mistakes” [13]. These solutions either require additional annotations and feature engineering that is specific to vision tasks or do not provide insight into why a decision should not be trusted. Furthermore, they assume that the current evaluation metrics are reliable, which may not be the case if problems such as data leakage are present. This is not the first work to look at the problems with relying on validation set accuracy as the primary measure of trust. This area has been well studied. Practitioners often overestimate their model's accuracy, propagate feedback loops, and fail to notice data leaks. Common solutions in the literature are Gestalt and Modeltracker. The LIME technique complements these tools. <br />
<br />
The experiments are conducted mainly on text data and image data, where models used are Random Forest (RF) and Convolutional Neural Networks (CNN). The experiments highlight utility of explanations in deciding between the models, which model is to be trusted and which is not and also improving the model based on the explanations. The main contributions of this paper are summarized as follows.<br />
<br />
* LIME, an algorithm to explain any prediction of any model (classifier or regressor) by approximating locally with an interpretable model that is faithful to the original model locally. <br />
<br />
* SP-LIME, a method to select a final set of representative samples using sub-modular optimization to show to the user that pretty much captures what the original model is doing globally. <br />
<br />
* A detailed set of experiments with simulated subjects to prove the usefulness of LIME and SP-LIME. <br />
<br />
[[File:LIME.jpg|thumb|550px|Figure 1: Explaining individual predictions. A model predicts that a patient has the flu, and LIME highlights the symptoms in the patient’s history that led to the prediction. Sneeze and headache are portrayed as contributing to the “flu” prediction, while “no fatigue” is evidence against it. With these, a doctor can make an informed decision about whether to trust the model’s prediction.]]<br />
<br />
== Need for Explanations ==<br />
<br />
Prediction explanation basically means answering question such as "What triggers the model to make such a prediction?", and an answer to that would be to say, these are the features with a certain range of values in your input sample that are contributing to the prediction the most. For example, it can be words in a document for text classification or patches of pixels in an image for image classification. And these explanations need to make sense to the user and hence has to be simple. Figure 1 illustrates an explanation procedure. In this case, an explanation is a small weighted list of symptoms that either contribute to the prediction (in green) or are evidence against it (in red). It is very clear that the life of a doctor is much easier in terms of making a decision with the help of a model if intelligible explanations are provided. This is similar to the topic of inference versus prediction. Some people tend to think of these two terms are the same. However, statistically, their meanings are very different. Inference means given some datasets, we would like to infer how the output/response is generated based on a sequence of explanatory variables. For instance, we would like to know/infer how education level affects people’s income level. Prediction, on the other hand, is by using a given dataset, we fit/train a model that will correctly predict the outcome of a new observation. <br />
<br />
Even with highest accuracy on validation dataset, we sometimes can't judge how the model is going to behave on other datasets. There are many reasons why this can happen. For example, Data leakage where some of the features used for training are heavily correlated with the target value in both training and validation data that might result in great train and validation results but will be of no use if used on altogether a new dataset. But if explanations such as the one in Figure 1 are given, it becomes easy to fix the fault by removing the heavily correlated feature from the data and train. This is how you convert an untrustworthy model to a trustworthy one. Another problem is dataset shift, where train and test data come from different distributions and providing explanations for predictions will help the user to take measures to make model generalize better.<br />
<br />
Figure 2 shows how explanations help in selecting between the models in addition to accuracy measures. In this case, by looking at the explanations it is easy to say that model with higher accuracy is in-fact the worst. Further, many a times, there can be difference between what we want the model to optimize and what the model is actually optimizing which might result into model not behaving as well as expected (For example in binary classification using logistic regression, all we are concerned is with accuracy, but the model is trying to reduce the log-loss due to the learning algorithms limitations). While we may not be able to quantitatively measure what difference has that made, but we still have some intuitions as to how the model has to behave with respect to some of its features, and if explanations are given for model predictions, we will be able to ascertain if those explanations make any sense. Also explaining models being model agnostic helps as we then compare different classes of models in making a choice.<br />
<br />
[[File:explanation_example.png|thumb|550px|Figure 2: Explaining individual predictions of competing classifiers trying to determine if a document is about “Christianity” or “Atheism”. The bar chart represents the importance given to the most relevant words, also highlighted in the text. Color indicates which class the word contributes to (green for “Christianity”, magenta for “Atheism”).]]<br />
<br />
===Desired Characteristics for Explainers===<br />
<br />
Explainers have to be simple models so that humans are able to comprehend what the model is doing. For example, even with linear models, if the number of features are too many, it becomes hard for humans to comprehend an overwhelming set of features. In case of models trained with word-embedding as features, we can't give these word-embeddings as explanations, instead, they need to something different than these features.<br />
<br />
Another requirement of a good explainer is that it has to be locally faithful to the original model, i.e, it must behave similarly to the model in the vicinity of the instance being predicted. And these local explanations should aid in forming global explanations.<br />
<br />
===Desired Characteristics for Explanations===<br />
<br />
'''Interpretable'''(understanding between the input and response)<br />
* Take into account user limitations.<br />
* Since features in a machine learning model need not be interpretable, the input to the explanations may have to be different from input to the model.<br />
<br />
'''Local Fidelity'''<br />
* Explanation should be locally faithful, ie it should correspond to how the model behaves in the vicinity of the instance being predicted. It doesn't mean global fidelity, where global important features may not be important in the local context.<br />
<br />
'''Model Agnostic'''<br />
* Treat the original, given model as a black box.<br />
<br />
'''Global Perspective'''(explain the model)<br />
* Select a few predictions such that they represent the entire model.<br />
<br />
==Local Interpretable Model-Agnostic Explanations (LIME)==<br />
<br />
The overall goal of LIME is to basically identify an interpretable model over the interpretable representation (features) that is locally faithful to the classifier. Here, interpretable explanations<br />
need to use a representation that is understandable to humans, regardless of the actual features used by the model. For example in case of text classification, a possible interpretable representation of data would be to use bag of words (one-hot encodings) instead of the word embeddings. Similarly, for image classification, it may be a binary vector (one-hot encoding) indicating the “presence” or “absence” of a contiguous patch of pixels (a super-pixel or a segment), while the classifier may represent the image as a tensor with three color channels per pixel. Let $x ∈ R^d$ be the original representation of an instance being explained, and $x' ∈ \{0, 1\}^{d'}$ be a binary vector for its interpretable representation.<br />
<br />
=== Fidelity-Interpretability Trade-Off ===<br />
Formally, the authors define an explanation as a model $g ∈ G$, where $G$ is a class of potentially interpretable models, such as linear models, decision trees, or falling rule lists [3]. The domain of $g$ is $\{0, 1\}^{d'}$, i.e. $g$ acts over absence/presence of the interpretable components. Let $Ω(g)$ be a measure of complexity (like a regularizer term) of the explanation $g ∈ G$. For example, for decision trees $Ω(g)$ may be the depth of the tree, for linear models, $Ω(g)$ may be the number of non-zero weights. Let the model being explained be denoted $f : R^d → R$. In classification, $f(x)$ is the probability that $x$ belongs to a certain class. We further use $π_x(z)$ as a proximity measure between an instance $z$ to $x$, so as to define locality around $x$. Finally, let $L(f, g, π_x)$ be a measure of error between $g$ and $f$ in the locality defined by $π_x$. Our task is to minimize $L(f, g, π_x)$ while having $Ω(g)$ be low enough to be interpretable by humans. The minimum value of $L(f, g, π_x)$ provides an approximate balance measure for preserving interpretability and local fidelity(it should correspond to how the model behaves in the vicinity of the instance being predicted). The explanation produced by LIME is obtained by the following:<br />
<br />
:<math>\xi(x) = \underset{g\in\mathbb{G}}{\operatorname{argmin}}\, \mathcal{L}(f,g,\pi_x) + \Omega(g)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, Eq. (1)</math><br />
<br />
Even though $G$ can be any class of interpretable models, in this paper only sparse linear models are considered.<br />
<br />
=== Sampling for Local Exploration ===<br />
<br />
The aim is to minimize $L(f, g, π_x)$ without making any assumptions about $f$, since the explainer needs to be model-agnostic. In order to learn the local behavior of $f$ using $g$ as the interpretable inputs vary, $L(f, g, π_x)$ is approximated by drawing samples, weighted by $π_x$. We sample instances around $x'$ by setting few features of $x'$ to 0, uniformly at random (where the number of such features set to 0 is also uniformly sampled). Given a perturbed sample $z' ∈ \{0, 1\}^{d'}$ (which contains a fraction of the features of $x'$ set to 0), we recover the sample in the original representation $z ∈ R^d$ and obtain the prediction $f(z)$ from the original classifier, and is used as a label for the explanation model. Given this dataset $Z$ of perturbed samples with the associated labels, we optimize Eq. (1) to get an explanation $ξ(x)$. The primary intuition behind LIME is presented in Figure 3, where we sample instances both in the vicinity of $x$ (which have a high weight due to $π_x$) and far away from $x$ (low weight from $π_x$). Even though the original model may be too complex to explain globally, LIME presents an explanation that is locally faithful (linear in this case), where the locality is captured by a weight factor, $π_x$. A concrete example of this process is presented in the next section.<br />
<br />
[[File:decision_boundary.png|thumb|Figure 3: Toy example to present intuition for LIME. The black-box model’s complex decision function $f$ (unknown to LIME) is represented by the blue/pink background, which cannot be approximated well by a linear model. The bold red cross is the instance being explained. LIME samples instances, get predictions using $f$, and weighs them by the proximity to the instance being explained (represented here by size). The dashed line is the learned explanation that is locally (but not globally) faithful.]]<br />
<br />
===Sparse Linear Explanations===<br />
<br />
For the rest of this paper, we let $G$ be the class of linear models, such that $g(z') = w_g · z'$ . We use the locally weighted square loss as $L$, as defined in Eq. (2), where we let $π_x(z) = exp(−D(x, z)^2 /σ^2 )$ be an exponential kernel defined on some distance function $D$ (e.g. cosine distance for text, L2 distance for images) with width $σ$ resulting in higher sample weights for perturbed samples that are in the vicinity of sample being explained and lower weights for samples that are far away.<br />
<br />
:<math>\mathcal{L}(f,g,\pi_x) = \underset{z,z'\in\mathbb{Z}}{\operatorname{\Sigma}} \pi_x(z)\,(f(z) - g(z'))^2\,\,\,\,\,\,\,\,\,\,\,\, Eq.(2)</math><br />
<br />
For text classification, the features used for explanation models are bag of words, and we set a limit on the number of words used for explanation, $K$. i.e. $Ω(g) = ∞1[||w_g||_0 > K]$. We use the same $Ω$ for image classification, using “super-pixels” (computed using any standard image segmentation algorithm, eg. quickshift() function in skimage python library [5]) instead of words, such that the interpretable representation of an image is a binary vector where 1 indicates the original super-pixel and 0 indicates a grayed out super-pixel. This particular choice of $Ω$ makes directly solving Eq. (1) intractable, but we approximate it by first selecting $K$ features with Lasso (using the regularization path [4]) (scikit lars_path method [6]) and then learning the weights via least squares (together a procedure we call K-LASSO in Algorithm 1). . <br />
<br />
There are some limitations to class of locally interpretable models $G$. It can happen that no $g \in G$ is powerful enough to be locally faithful to the original model, i.e, $g$ has a high bias w.r.t $f$ locally. This may be because the underlying model is quite complex even locally. Another issue could be, the features used for building explanations are not representative of the inputs or factors that the underlying model relies on. For example, a model that predicts sepia-toned images to be retro cannot be explained by the presence or absence of super pixels.<br />
<br />
<br />
[[File:algorithm1.png|thumb|550px]]<br />
<br />
==Examples==<br />
<br />
===Example 1: Text classification with SVMs===<br />
<br />
In Figure 2 (right side), LIME explains an SVM classifier with RBF kernel trained on "20 newsgroup dataset" to classify documents into "Atheism" and "Christianity". Even though it achieves 94% accuracy on validation dataset, the explanation for an instance shows that features that are used the most in predictions are very arbitrary such as words like "Posting", "Host" and "Re", which have no connection to either Christianity or Atheism. Even if these stop words or headers are removed, the classifier still considers the proper names of people writing the post to be important features, which clearly doesn't make sense. Hence we can conclude by looking at the explanations that the dataset has serious issues and when a classifier is trained on it, it won't be able to generalize well. Hence suitable steps needed to be taken to train a trustworthy classifier.<br />
<br />
===Example 2: Deep networks for images===<br />
<br />
In this example, we use sparse linear explanations to explain Google’s pre-trained Inception neural network [7] which is an image classifier. Here features used for giving explanations are super-pixels (segments) which intuitively makes sense as we humans detect an object in an image based on certain parts of the object. Figure 4a show an arbitrary image that we want to explain. Figures 4b, 4c, 4d show the super-pixels that contribute the most while predicting the image to be one of the top 3 classes. Here we set max feature count for explainer, $K = 10$. It is quite intuitive from Figure 4b in particular that why acoustic guitar was predicted to be an electric based on the fretboard that resembles an acoustic guitar. So even though the prediction is incorrect, it still is not unreasonable at all.<br />
<br />
[[File:inception_example.png|thumb|center|550px]]<br />
<br />
==Submodular Pick for Explaining Models==<br />
<br />
Submodular pick provides some sort of a global understanding of the model by methodically picking a set of non-redundant sample explanations. This method is still model agnostic and complimentary to calculating validation accuracy in machine learning problem. If we choose a large number of instances to explain, the user may not be able to go through each of them and verify what is wrong or right with the model. We say humans have a budget $B$, that is the number of instance explanations that they are willing to examine. So given a set of $X$ instances, the task of sub-modular optimization is to pick a maximum of $B$ instances that effectively capture the essence of what the model is doing in a non-redundant way.<br />
<br />
Here we have a explanations for $n$ instances and we construct an $n × d'$ matrix $W$, each row of which represent local importances of features used for each instance explanations. In case of using sparse linear models as explanations, for an instance $x_i$ and explanation $g_i = ξ(x_i)$, we set $W_{ij} = |w_{gij}|$. Further, for each feature (column) $j$ in $W$, let $I_j$ denote the global importance of that component in the explanation space. Intuitively, $I$ represents global importance of each feature. In Figure 5, we show a toy example $W$, with $n = d' = 5$, where $W$ is binary (for simplicity). Since feature $f2$ is used in most of the explanations, $I$ should have higher value for feature $f2$ than $f1$, i.e. $I_2 > I_1$. Formally for the text classifiers, we set $I_j = \sqrt{\sum_{i=1}^nW_{ij}}$ . For images, $I$ must represent something that can be compared across the super-pixels (segments) in different images, such as color histograms or other features of super-pixels; But this is left as a future work for now.<br />
<br />
While picking the samples, we must be careful in not picking up samples that explain the same thing. So we want a minimum set of samples that cover the maximum number of features as possible. In Figure 5, after the second row is picked, the third row is redundant, as the user has already seen features $f2$ and $f3$ - while the last row gives the user completely new features. Hence selecting the second and last row results in giving the maximum coverage of features. Eq. (3) formalizes this process, where we define coverage as the set function $c$ that, given $W$ and $I$, computes the total importance of the features that at least appear in one instance in a set $V$ .<br />
<br />
:<math>c(V, W, I) = {{\sum_{j = 1}^{d'}}}\mathbb{1}_{[\exists i\in V:W_{ij}\gt 0]}I_j\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, Eq. (3)</math><br />
<br />
The pick problem, defined in Eq. (4), consists of finding the set $V, |V| ≤ B$ that achieves the highest coverage.<br />
<br />
:<math>Pick(W, I) = \underset{V,|V|\leq B}{\operatorname{argmax}}\, c(V,W, I)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, Eq. (4)</math><br />
<br />
The problem in Eq. (4) is maximizing a weighted coverage function, and is NP-hard [10]. Let $c(V\cup\{i\}, W, I)−c(V, W, I)$ be the marginal coverage gain of adding an instance $i$ to a set $V$ . Due to sub-modularity, a greedy algorithm that iteratively adds the instance with the highest marginal coverage gain to the solution offers a constant-factor approximation guarantee of $1−1/e$ to the optimum [8]. This approximation process is outlined in Algorithm 2 and is called sub-modular pick.<br />
<br />
[[File:algorithm2.png|thumb|550px]] <br />
[[File:toy_example.png|thumb|Figure 5: Toy example W. Rows represent instances (documents) and columns represent features (words). Feature f2 (dotted blue) has the highest importance. Rows 2 and 5 (in red) would be selected by the pick procedure, covering all but feature f1.]]<br />
<br />
==Simulated User Experiments==<br />
<br />
This section highlights simulated user experiments conducted to evaluate how explanations help the user in different tasks. In particular, following questions are addressed: (1) Are the explanations faithful to the model, that is are the explainers correctly highlighting the important features that the model is selecting at least locally? (2) Can the explanations help users in ascertaining trust in predictions; that is to say, if the predictions make sense or not? and (3) Are the explanations useful for evaluating the model as a whole and thereby aid in finally selecting a model? Code and data for experiments are available at https://github.com/marcotcr/lime-experiments.<br />
<br />
===Experiment Setup===<br />
<br />
2 sentiment analysis datasets (books and DVDs, 2000 instances each) where the task is to classify product reviews as positive or negative [9] are used in the experiment. Decision trees (DT), logistic regression with L2 regularization (LR), nearest neighbors (NN), and SVMs with RBF kernel, all using bag of words as features are trained separately. There is also a random forest (with 1000 trees) trained over average word2vec embeddings (RF). Unless otherwise noted, default methods of scikit are used. Dataset is divided into train and test in the ratio of 4:1. To explain individual predictions, LIME is compared with parzen [10], a method that approximates the black box classifier globally with Parzen windows, and explains individual predictions by taking the gradient of the prediction probability function. (Parzen-window density estimation is essentially a data-interpolation technique. Given an instance of the random sample, ${\bf x}$, Parzen-windowing estimates the PDF $P(X)$ from which the sample was derived. It essentially superposes kernel functions placed at each observation or datum. In this way, each observation $x_i$ contributes to the PDF estimate [15]). For parzen, $K$ features are selected as explanations such that they have the highest absolute gradients. Hyper-parameters for LIME and parzen are set using cross-validations and number of perturbed samples for a single explanation, $N$ is set to 15000. There is also a comparison made against a greedy procedure in which features that contribute the most to the predicted class are greedily removed until prediction changes or we reach a maximum of $K$ features that are removed, and these $K$ or fewer features are finally selected to explain the model. Also a random procedure that randomly picks K features as an explanation is also used for comparison. For all the experiments below $K$ is set to 10. For experiments where the pick procedure applies, it's either a random pick (RP) or sub-modular pick (SP). We refer to pick-explainer combinations by adding RP or SP as a prefix.<br />
<br />
[[File:recall1.png|thumb|Figure 6: Recall [11] on truly important features for two interpretable classifiers on the books dataset.]] [[File:recall2.png|thumb|Figure 7: Recall on truly important features for two interpretable classifiers on the DVDs dataset.]]<br />
<br />
===Are explanations faithful to the model?===<br />
<br />
Here we evaluate faithfulness of explanations on classifiers that are by themselves explicable such as sparse logistic regression or decision trees. So we train 2 models, a sparse logistic regressor and a decision tree such that the maximum number of features they use 10 so we now know what the important features (gold features) really are for these 2 models. For each prediction on the test set, we generate explanations and compute the fraction of these gold features that are recovered by the explanations (recall). We report this recall averaged over all the test instances in Figures 6 and 7. From the bar graphs we can see that the greedy approach is comparable to parzen on logistic regression, but is substantially worse on decision trees since changing a single feature at a time often does not have an effect on the prediction. The overall recall by parzen is low, likely due to the difficulty in approximating the original high-dimensional classifier. LIME consistently provides > 90% recall for both classifiers on both datasets, and this demonstrates that LIME explanations are more faithful to the model than any other explanations.<br />
<br />
===Should I trust this prediction?===<br />
<br />
In order to simulate trust in individual predictions, we first randomly select 25% of the features to be “untrustworthy”, and assume that the users can identify and would not want to trust these features (such as the headers in 20 newsgroups, leaked data, etc). We thus develop oracle “trustworthiness” by labeling test set predictions from a black box classifier as “untrustworthy” if the prediction changes when untrustworthy features are removed from the instance, and “trustworthy” otherwise. In order to simulate users, we assume that users deem predictions untrustworthy from LIME and parzen explanations if the prediction from the linear approximation changes when all untrustworthy features that appear in the explanations are removed (the simulated human “discounts” the effect of untrustworthy features). For greedy and random, the prediction is mistrusted if any untrustworthy features are present in the explanation since these methods do not provide a notion of the contribution of each feature to the prediction. Thus for each test set prediction, we can evaluate whether the simulated user trusts it using each explanation method, and compare it to the trustworthiness oracle. <br />
<br />
Using this setup, we report the score, F1 (average of precision and recall) on the trustworthy predictions for each explanation method, averaged over 100 runs, in Table 1. The results indicate that LIME dominates others (all results are significant at p = 0.01) on both datasets, and for all of the black box models. The other methods either achieve a lower recall (i.e. they mistrust predictions more than they should) or lower precision (i.e. they trust too many predictions), while LIME maintains both high precision and high recall. Even though we artificially select which features are untrustworthy, these results indicate that LIME is helpful in assessing trust in individual predictions.<br />
<br />
[[File:table1.png|thumb|Table 1: Average F1 of trustworthiness for different explainers on a collection of classifiers and datasets.]][[File:choose_classifier.png|thumb|Figure 8: Choosing between two classifiers, as the number of instances shown to a simulated user is varied. Averages and standard errors from 800 runs.]]<br />
<br />
===Can I trust this model?===<br />
<br />
In the final simulated user experiment, we consider a case where the user has to pick between 2 competing models with similar accuracy on validation dataset. For this purpose, we add 10 artificially “noisy” features. Specifically, on training and validation sets (80/20 split of the original training data), each artificial feature appears in 10% of the examples in one class, and 20% of the other, while on the test instances, each artificial feature appears in 10% of the examples in each class. This recreates the situation where the models use not only features that are informative in the real world, but also ones that introduce spurious correlations. We create pairs of competing classifiers by repeatedly training pairs of random forests with 30 trees until their validation accuracy is within 0.1% of each other, but their test accuracy differs by at least 5%. Thus, it is not possible to identify the better classifier (the one with higher test accuracy) from the accuracy on the validation data. <br />
<br />
The goal of this experiment is to evaluate whether a user can identify the better classifier based on the explanations of $B$ instances from the validation set. The simulated human marks the set of artificial features that appear in the $B$ explanations as untrustworthy, following which we evaluate how many total predictions in the validation set should be trusted (as in the previous section, treating only marked features as untrustworthy). Then, we select the classifier with fewer untrustworthy predictions and compare this choice to the classifier with higher held-out test set accuracy. <br />
<br />
We present the accuracy of picking the correct classifier as $B$ varies, averaged over 800 runs, in Figure 8. We omit SP-parzen and RP-parzen from the figure since they did not produce useful explanations, performing only slightly better than random. LIME is consistently better than greedy, irrespective of the pick method. Further, combining sub-modular pick with LIME outperforms all other methods, in particular, it is much better than RP-LIME when only a few examples are shown to the users. These results demonstrate that the trust assessments provided by SP-selected LIME explanations are good indicators of generalization."""<br />
<br />
==Conclusion and Future Work==<br />
<br />
Conclusion<br />
<br />
The paper evaluates its approach on a series of simulated and human-in-the-loop tasks to check:<br />
* The predictions of any classifier in an interpretable and faithful manner.<br />
* The method to explain its models by obtaining individual predictions and their explanations.<br />
* Could the predictions be trusted.<br />
* Can the model be trusted.<br />
* Can users select the best classifier given the explanations.<br />
* Can user (non-experts) improve the classifier by means of feature selection.<br />
* Can explanations lead to insights about the model itself.<br />
<br />
This paper successfully argues for the importance of explaining the predictions to make the model more trustworthy to the user. It proposes LIME, a comprehensive approach to faithfully explain predictions of any model in the simplest way desired. Detailed experiments demonstrating how explanations helped users pick models or discard models are provided. As a future consideration, authors want to repeat the experiments with decision trees being the interpretable model instead of sparse linear models. Authors also want to come up with an approach to perform the pick step for images. They also want to look into an automated way of selecting hyper-parameters used in LIME and SP-LIME such as K, N etc. and finally consider running the code on GPU to create explanations in real-time.<br />
<br />
==Remarks and Critique==<br />
<br />
This paper is by far the best paper I have read on model interpretability. The paper is full of new ideas and the authors clearly explain the approaches used with reasons as to why they picked it and also highlight both weaknesses and strong points of the approaches. Even though some of the tricks used are a little ambiguous on paper such as K-Lasso, super-pixels, but going through the code on GitHub, it becomes more clear. The code is also very well documented and easy to install and run to reproduce the results published. I tried running it on CIFAR-10 dataset and found it to be useful in understanding my model better.<br />
<br />
There are few minor details that are unanswered in the paper.<br />
<br />
* To interpret prediction for an image classifier, authors use super-pixels (segments) as features, but they don't give any reason as to why they picked this approach. One of the reasons I can think of is that convolutional neural networks learn shapes from layer to layer, which becomes more and more complete as we move from top to lower layers. So in a way you can say, the classifier classifies on the basis of set of groups of pixels close to each other (as in the case of segments) than just the pixels that are spread apart far away from each other.<br />
<br />
* The authors don't say how locally faithful the interpretable model should be to the classifier with constraints on the maximum number of features to be used, i.e, what is the sort of error (mean squared error in case of linear interpretable models) we are content with?<br />
<br />
* It seems that LIME can easily extended to regression cases and is not just limited to classification tasks, but the paper doesn't discuss anything in this regard.<br />
<br />
This paper is also comparable to the DeepLIFT[17] method, which provides a global numerical recovery of sample manifold. Both methods rely heavily on the reference point chosen. LIME has the advantage of matching the behavior of the model with interpretable explanations, while DeepLIFT focuses on the model structure and yields overall assessment of the model by manually feeding several typical reference inputs into the method.<br />
<br />
In this paper, the experiments with Deep Convolutional Neural Network try to find the pixels which contribute most to the final output. As a complement reference, this paper[16] also focuses on the interpretability of CNN models by sharing the same idea. In that work, the authors use Global Averaging Pooling layer after the last convolutional layer. Then a heat map is generated by a weighted sum of the outputs from last convolutional layer. By plugging back to the input image, the area that contributes most to the output is highlighted. <br />
<br />
<gallery widths=300px heights=300px><br />
File:CAM.png|thumb|left|550px|Figure 9: example output from CAM<br />
File:cam2.jpg|thumb|right|695px|Figure 10: example output from CAM<br />
</gallery><br />
<br />
As a result, we can actually tell what is the model learning. Examples are shown in Figure 9, 10.<br />
<br />
==References==<br />
[1] Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin, Model-Agnostic Interpretability of Machine Learning, presented at 2016 ICML Workshop on Human Interpretability in Machine Learning (WHI 2016), New York, NY<br />
<br />
[2] Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin, “Why Should I Trust You?” Explaining the Predictions of Any Classifier, KDD 2016 San Francisco, CA, USA<br />
<br />
[3] F. Wang and C. Rudin. Falling rule lists. In Artificial Intelligence and Statistics (AISTATS), 2015.<br />
<br />
[4] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004.<br />
<br />
[5] http://scikit-image.org/docs/dev/api/skimage.segmentation.html#skimage.segmentation.quickshift<br />
<br />
[6] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.lars_path.html<br />
<br />
[7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.<br />
<br />
[8] A. Krause and D. Golovin. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems. Cambridge University Press, February 2014.<br />
<br />
[9] J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Association for Computational Linguistics (ACL), 2007.<br />
<br />
[10] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Muller. How to explain individual classification decisions. Journal of Machine Learning Research, 11, 2010.<br />
<br />
[11] https://en.wikipedia.org/wiki/Precision_and_recall<br />
<br />
[12] A. Bansal, A. Farhadi, and D. Parikh. Towards transparent systems: Semantic characterization of failure modes. In European Conference on Computer Vision (ECCV), 2014.<br />
<br />
[13] M. T. Dzindolet, S. A. Peterson, R. A. Pomranky, L. G. Pierce, and H. P. Beck. The role of trust in automation reliance. Int. J. Hum.-Comput. Stud., 58(6), 2003.<br />
<br />
[14] P. Zhang, J. Wang, A. Farhadi, M. Hebert, and D. Parikh. Predicting failures of vision systems. In Computer Vision and Pattern Recognition (CVPR), 2014.<br />
<br />
[15] Parzen Density Windows :- https://www.cs.utah.edu/~suyash/Dissertation_html/node11.html<br />
<br />
[16] Zhou, Bolei et al. “Learning Deep Features for Discriminative Localization.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 2921-2929.<br />
<br />
[17] Avanti Shrikumar, Peyton Greenside, Anshul Kundaje: Learning Important Features Through Propagating Activation Differences ([https://arxiv.org/abs/1704.02685 link])</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Teaching_Machines_to_Describe_Images_via_Natural_Language_Feedback&diff=31248STAT946F17/ Teaching Machines to Describe Images via Natural Language Feedback2017-11-23T16:43:53Z<p>Amanjhunjhunwala: /* Comments */</p>
<hr />
<div>= Introduction = <br />
In the era of Artificial Intelligence, one should ideally be able to educate the robot about its mistakes, possibly without needing to dig into the underlying software. Reinforcement learning (RL) has become a standard way of training artificial agents that interact with an environment. RL agents optimize their action policies so as to maximize the expected reward received from the environment. Several works explored the idea of incorporating humans into the learning process, in order to help the reinforcement learning agent to learn faster. In most cases, the guidance comes in the form of a simple numerical (or “good”/“bad”) reward. In this work, natural language is used as a way to guide an RL agent. The author argues that a sentence provides a much stronger learning signal than a numeric reward in that we can easily point to where the mistakes occur and suggest how to correct them. <br />
<br />
[[File:teachingF.png|700px]]<br />
<br />
Here the goal is to allow a non-expert human teacher to give feedback to an RL agent in the form of natural language, just as one would to a learning child. The author has focused on the problem of image captioning, a task where the content of an image is described using sentences. This can also be seen as a multimodal problem where the whole network/model needs to combine the solution space of learning in both the image processing and text-generation domain. Image captioning is an application where the quality of the output can easily be judged by non-experts.<br />
<br />
= Related Works =<br />
Several works incorporate human feedback to help an RL agent learn faster.<br />
#Thomaz et al. (2006) exploits humans in the loop to teach an agent to cook in a virtual kitchen. The users watch the agent learn and may intervene at any time to give a scalar reward. Reward shaping (Ng et al., 1999) is used to incorporate this information in the Markov Decision Process (MDP).<br />
#Judah et al. (2010) iterates between “practice”, during which the agent interacts with the real environment, and a critique session where a human labels any subset of the chosen actions as good or bad.<br />
#Knox et al., (2012) discusses different ways of incorporating human feedback, including reward shaping, Q augmentation, action biasing, and control sharing.<br />
#Griffith et al. (2013) proposes policy shaping which incorporates right/wrong feedback by utilizing it as direct policy labels. <br />
#Mao et. al. propose a multimodal Recurrent Neural Network (m-RNN) for image captioning on 4 crucial datasets: IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO [14].Their approach involves a double network comprising of a deep RNN for sentence generation and a deep CNN for image learning.<br />
<br />
Above approaches mostly assume that humans provide a numeric reward, unlike in this work where feedback is given in natural language. A few attempts have been made to advise an RL agent using language.<br />
# Maclin et al. (1994) translated advice to a short program which was then implemented as a neural network. The units in this network represent Boolean concepts, which recognize whether the observed state satisfies the constraints given by the program. In such a case, the advice network will encourage the policy to take the suggested action.<br />
# Weston et al. (2016) incorporates human feedback to improve a text-based question answering agent.<br />
# Kaplan et al. (2017) exploits textual advice to improve training time of the A3C algorithm in playing an Atari game.<br />
<br />
The authors propose the Phrase-based Image Captioning Model which is similar to most image captioning models except that it exploits attention and linguistic information. Several recent approaches trained the captioning model with policy gradients in order to directly optimize for the<br />
desired performance metrics. This work follows the same line. <br />
<br />
There is also similar efforts on dialogue based visual representation learning and conversation modeling. These models aim to mimic human-to-human conversations while in this work the human converses with and guides an artificial learning agent.<br />
<br />
= Methodology =<br />
The framework consists of a new phrase-based captioning model trained with Policy Gradients that incorporates natural language feedback provided by a human teacher. The phrase-based captioning model allows natural guidance by a non-expert.<br />
<br />
=== Phrase-based Image Captioning ===<br />
The captioning model uses a hierarchical recurrent neural network (RNN). Hierarchical RNN is a complex architecture that typically uses stacked RNNs because of their ability to build better hidden states near the output level. In this paper, author have utilized Hierarchical RNN composed of a two-level LSTMs, a '''phrase RNN''' at the top level, and a '''word RNN''' at second level that generates a sequence of words for each phrase. The model receives an image as input and outputs a caption. One can think of the phrase RNN as providing a “topic” at each time step, which instructs the word RNN what to talk about. The structure of the model is explained in Figure 1.<br />
<br />
[[File:modelham.png|center|500px|thumb|Figure 1: Hierarchical phrase-based captioning model, composed of a phrase-RNN at the top level, and a word level RNN which outputs a sequence of words for each phrase.]]<br />
<br />
A convolutional neural network is used in order to extract a set of feature vectors $a = (a_1, \dots, a_n)$, with $a_j$ a feature in location j in the input image. These feature vectors are given to the attention layer. There are also two more inputs to the attention layer, current hidden state of the phrase-RNN and output of the label unit. The label unit predicts one out of four possible phrase labels, i.e., a noun (NP), preposition (PP), verb (VP), and conjunction phrase (CP). This information could be useful for the attention layer. For example, when we have a NP the model may look at objects in the image, while for VP it may focus on more global information. Computations can be expressed with the following equations:<br />
<br />
$$<br />
\begin{align*}<br />
\small{\textrm{hidden state of the phrase-RNN at time step t}} \leftarrow h_t &= f_{phrase}(h_{t-1}, l_{t-1}, c_{t-1}, e_{t-1}) \\<br />
\small{\text{output of the label unit}} \leftarrow l_t &= softmax(f_{phrase-label}(h_t)) \\<br />
\small{\text{output of the attention layer}} \leftarrow c_t &= f_{att}(h_t, l_t, a)<br />
\end{align*}<br />
$$<br />
<br />
After deciding about phrases, the outputs of phrase-RNN go to another LSTM to produce words for each phrase. $w_{t,i}$ denotes the i-th word output of the word-RNN in the t-th phrase. There is an additional <EOP> token in word-RNN’s vocabulary, which signals the end-of-phrase. Furthermore, $h_{t,i}$ denotes the i-th hidden state of the word-RNN for the t-th phrase. <br />
$$<br />
h_{t,i} = f_{word}(h_{t,i-1}, c_t, w_{t,i}) \\<br />
w_{t,i} = f_{out}(h_{t,i}, c_t, w_{t,i-1}) \\ <br />
e_t = f_{word-phrase}(w_{t,1}, \dots ,w_{t,n})<br />
$$<br />
<br />
Note that $e_t$ encodes the generated phrase via simple mean-pooling over the words, which provides additional word-level context to the next phrase.<br />
<br />
=== Crowd-sourcing Human Feedback ===<br />
The authors have created a web interface that allows collecting feedback information on a larger scale via AMT. Figure 2 depicts the interface and an example of caption correction. A snapshot of the model is used to generate captions for a subset of MS-COCO images using greedy decoding. There are two rounds of annotation. In the first round, the annotator is shown a captioned image and is asked to assess the quality of the caption, by choosing between: perfect, acceptable, grammar mistakes only, minor or major errors. They ask the annotators to choose minor and major error if the caption contained errors in semantics. They advise them to choose minor for small errors such as wrong or missing attributes or awkward prepositions and go with major errors whenever any object or action naming is wrong. A visualization of this web-based interface is provided in Figure 3(a).<br />
<br />
[[File:crowd.png|600px|center|thumb|Figure 2: An example of a generated caption and its corresponding feedback]]<br />
[[File:teaching 1.PNG|600px|center|thumb|Figure 3(a): Web-based feedback collection interface]]<br />
For the next (more detailed, and thus more costly) round of annotation, They only select captions which are not marked as either perfect or acceptable in the first round. Since these captions contain errors, the new annotator is required to provide detailed feedback about the mistakes. Annotators are asked to:<br />
#Choose the type of required correction (something “ should be replaced”, “is missing”, or “should be deleted”) <br />
#Write feedback in natural language (annotators are asked to describe a single mistake at a time)<br />
#Mark the type of mistake (whether the mistake corresponds to an error in object, action, attribute, preposition, counting, or grammar)<br />
#Highlight the word/phrase that contains the mistake <br />
#Correct the chosen word/phrase<br />
#Evaluate the quality of the caption after correction (it could be bad even after one round of correction)<br />
<br />
Figure 3(b) shows the statics of the evaluations before and after one round of correction task. The authors acknowledge the costliness of the second round of annotation.<br />
<br />
[[File:ham1.png|660px|center|thumb|Figure 3(b): Caption quality evaluation by the human annotators. The plot on the left shows evaluation for captions generated with the reference model (MLE). The right plot shows evaluation of the human-corrected captions (after completing at least one round of feedback).]]<br />
<br />
=== Feedback Network ===<br />
<br />
The collected feedback provides a strong supervisory signal which can be used in the RL framework. In particular, the authors design a neural network (feedback network or FBN) which will provide an additional reward based on the feedback sentence.<br />
<br />
RL training will require us to generate samples (captions) from the model. Thus, during training, the sampled captions for each training image will differ from the reference maximum likelihood estimation (MLE) caption for which the feedback is provided. The goal of the feedback network is to read a newly sampled caption and judge the correctness of each phrase conditioned on the feedback. This network performs the following computations:<br />
<br />
[[File:fbn.JPG|550px|right|thumb|Figure 4: The architecture of the feedback network (FBN) that classifies each phrase in a sampled sentence (top left) as either correct, wrong or not relevant, by conditioning on the feedback sentence.]]<br />
<br />
<br />
$$<br />
h_t^{caption} = f_{sent}(h_{t-1}^{caption}, \omega_t^c) \\<br />
h_t^{feedback} = f_{sent}(h_{t-1}^{feedback}, \omega_t^f) \\<br />
q_i = f_{phrase}(\omega_{i,1}^c, \omega_{i,2}^c, \dots, \omega_{i,N}^c) \\ <br />
o_i = f_{fbn}(h_T^{caption}, h_T^{feedback }, q_i, m) \\<br />
$$<br />
<br />
<br />
Here, $\omega_t^c$ and $\omega_t^f$ denote the one-hot encoding of words in the sampled caption and feedback sentence for the t-th phrase, respectively. FBN encodes both the caption and feedback using an LSTM ($f_{sent}$), performs mean pooling ($f_{phrase}$) over the words in a phrase to represent the phrase i with $q_i$, and passes this information through a 3-layer MLP ($f_{fbn}$). The MLP accepts additional information about the mistake type (e.g., wrong object/action) encoded as a one hot vector m. The output layer of the MLP is a 3-way classification layer that predicts whether the phrase i is correct, wrong, or not relevant (wrt feedback sentence).<br />
<br />
=== Policy Gradient Optimization using Natural Language Feedback ===<br />
<br />
One can think of a caption decoder as an agent following a parameterized policy $p_\theta$ that selects an action at each time step. An “action” in our case requires choosing a word from the vocabulary (for the word RNN), or a phrase label (for the phrase RNN). The objective for learning the parameters of the model is the expected reward received when completing the caption $w^s = (w^s_1, \dots ,w^s_T)$. Here, $w_t^s$ is the word sampled from the model at time step t.<br />
<br />
$$<br />
L(\theta) = -\mathop{{}\mathbb{E}}_{\omega^s \sim p_\theta}[r(w^s)] <br />
$$<br />
Such an objective function is non-differentiable. Thus policy gradients are used as in [13] to find the gradient of the objective function:<br />
$$<br />
\nabla_\theta L(\theta) = - \mathop{{}\mathbb{E}}_{\omega^s \sim p_\theta}[r(w^s)\nabla_\theta \log p_\theta(w^s)]<br />
$$<br />
Which is estimated using a single Monte-Carlo sample:<br />
$$<br />
\nabla_\theta L(\theta) \approx - r(w^s)\nabla_\theta \log p_\theta(w^s)<br />
$$<br />
Then a baseline $b = r(\hat \omega)$ is used. A baseline does not change the expected gradient but can drastically reduce the variance.<br />
$$<br />
\hat{\omega}_t = argmax \ p(\omega_t|h_t) \\<br />
\nabla_\theta L(\theta) \approx - (r(\omega^s) - r(\hat{\omega}))\nabla_\theta \log p_\theta(\omega^s)<br />
$$<br />
'''Reward:''' A sentence reward is defined as a weighted sum of the BLEU scores. BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Additionally, it was one of the first metrics to claim a high correlation with human judgements of quality [10, 11 and 12] and remains one of the most popular automated and inexpensive metrics (more information about BLUE score [http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf here] and a nice discussion on it [https://www.youtube.com/watch?v=ORHVgR-DVGg here]).<br />
<br />
$$<br />
r(\omega^s) = \beta \sum_i \lambda_i \cdot BLEU_i(\omega^s, ref)<br />
$$<br />
<br />
As reference captions to compute the reward, the authors either use the reference captions generated by a snapshot of the model which were evaluated as not having minor and major errors, or ground-truth captions. In addition, they weigh the reward by the caption quality as provided by the annotators (e.g. $\beta = 1$ for perfect and $\beta = 0.8$ for acceptable). They further incorporate the reward provided by the feedback network:<br />
$$<br />
r(\omega_t^p) = r(\omega^s) + \lambda_f f_{fbn}(\omega^s, feedback, \omega_t^p)<br />
$$<br />
Where $\omega^p_t$ denotes the sequence of words in the t-th phrase. $\omega^s$ denotes generated sentence, and $\omega^s = \omega^p_1 \omega^p_2 \dots \omega^p_P$.<br />
<br />
<br />
Note that FBN produces a classification of each phrase. This can be converted into reward, by assigning<br />
correct to 1, wrong to -1, and 0 to not relevant. So the final gradient takes the following form:<br />
$$<br />
\nabla_\theta L(\theta) = - \sum_{p=1}^{P}(r(\omega^p) - r(\hat{\omega}^p))\nabla_\theta \log p_\theta(\omega^p)<br />
$$<br />
<br />
=== Implementation ===<br />
The authors use Adam optimizer with learning rate $1e-6$ and batch size $50$. They first optimize the cross entropy loss for the first $K$ epochs, then for the following $t = 1$ to $T$ epochs, they use cross entropy loss for the first $P − \lfloor\frac{t}{m}\rfloor$ phrases (where $P$ denotes the number of phrases), and the policy gradient algorithm for the remaining $\lfloor \frac{t}{m} \rfloor$ phrases ($m = 5$).<br />
<br />
= Experimental Results =<br />
The authors used MS-COCO dataset. COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with key points. They used 82K images for training, 2K for validation, and 4K for testing. To collect feedback, they randomly chose 7K images from the training set, as well as all 2K images from validation. In addition, they use a word vocabulary size of 23,115.<br />
<br />
=== Phrase-based captioning model ===<br />
The authors analyze different instantiations of their phrase-based captioning in the following table. To sanity check, they compare it to a flat approach (word-RNN only). Overall, their model performs slightly worse (0.66 points). However, the main strength of their model is that it allows a more natural integration with feedback.<br />
<br />
[[File:table2.JPG|center]]<br />
<br />
=== Feedback network ===<br />
The authors use 9000 images to collect feedback; 5150 of them are evaluated as containing errors. Finally, they use 4174 images for the second round of annotation. They randomly select 9/10 of them to serve as a training set for feedback network, and 1/10 of them to be the test set. The model achieves the highest accuracy of 74.66% when they provide it with the kind of mistake the reference caption had (e.g. an object, action, etc). This is not particularly surprising as it requires the most additional information to train the model and the most time to compile the dataset for.<br />
<br />
=== RL with Natural Language Feedback ===<br />
The following table reports the performance of several instantiations of the RL models. All models have been pre-trained using cross-entropy loss (MLE) on the full MS-COCO training set. For the next rounds of training, all the models are trained only on the 9K images.<br />
<br />
The authors define “C” captions as all captions that were corrected by the annotators and were not evaluated as containing minor or major error, and ground-truth captions for the rest of the images. For “A”, they use all captions (including captions which were evaluated as correct) that did not have minor or major errors, and GT for the rest. A detailed breakdown of these captions is reported in the following table. The authors test their model in two separate cases:<br />
<br />
*They first test a model using the standard cross-entropy loss, but which now also has access to the corrected captions in addition to the 5GT captions. This model (MLEC) is able to improve over the original MLE model by 1.4 points. They then test the RL model by optimizing the metric wrt the 5GT captions. This brings an additional point, achieving 2.4 over the MLE model. Next, the RL agent is given access to 3GT captions, the “C" captions and feedback sentences. They show that this model outperforms the no-feedback baseline by 0.5 points. If the RL agent has access to 4GT captions and feedback descriptions, a total of 1.1 points over the baseline RL model and 3.5 over the MLE model will be achieved. <br />
<br />
*They also test a more realistic scenario, in which the models have access to either a single GT caption, “C" (or “A”), and feedback. This mimics a scenario in which the human teacher observes the agent and either gives feedback about the agent’s mistakes, or, if the agent’s caption is completely wrong, the teacher writes a new caption. Interestingly, RL, when provided with the corrected captions, performs better than when given GT captions. Overall, their model outperforms the base RL (no feedback) by 1.2 points.<br />
<br />
[[File:table3.PNG|center]]<br />
<br />
These experiments make an important point. Instead of giving the RL agent a completely new target (caption), a better strategy is to “teach” the agent about the mistakes it is doing and suggest a correction. This is not very difficult to understand intuitively - informing the agent of its error indeed conveys more information than teaching it a completely correct answer, because the latter forces the network to "train" its memory from a sample which is, at least seemingly, insulated from its prior memory.<br />
<br />
= Conclusion =<br />
In this paper, a human teacher is enabled to provide feedback to the learning agent in the form of natural language. The authors focused on the problem of image captioning. They proposed a hierarchical phrase-based RNN as the captioning model, which allowed natural integration with human feedback.<br />
They also crowd-sourced feedback and showed how to incorporate it in policy gradient optimization.<br />
<br />
= Critique =<br />
Authors of this paper consider both grand truth caption and feedback as the same source of information. However, feedback seems to contain much richer information than grand truth caption. So, maybe that's why RLF (reinforcement learning with feedback network) outperforms other models in experimental results. In addition, the hierarchical phrased based image captioning model does not outperform the baseline; thus, maybe that's a good idea to follow the same idea using existing image captioning models and then apply parsing techniques and compare the results. Another interesting future work one can do is to incorporate confidence score of feedback network in the learning process in order to emphasize strong feedbacks.<br />
<br />
= Comments =<br />
In the hierarchical phrase-based RNN, human involving is a key part of improving the performance of the network. According to this paper, the feedback LSTTM network is capable of handling simple sentences. What if the feedback is weak or even ambiguous? Is there a threshold for the feedback such that the network can refuse a wrong feedback? Follow this architecture, it would be interesting to see whether such feedback strategy can be applied in machine translation.<br />
<br />
My intuition says that this architecture is particularly well suited for handling sparse bad feedbacks due to the very nature of RNNs (robustness) and provides a solid foundation of feedback-augmented learning networks.<br />
<br />
= References=<br />
[1] Huan Ling and Sanja Fidler. Teaching Machines to Describe Images via Natural Language Feedback. In arXiv:1706.00130, 2017.<br />
<br />
[2] Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L. Isbell, and Andrea Lockerd Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In NIPS, 2013. <br />
<br />
[3] K. Judah, S. Roy, A. Fern, and T. Dietterich. Reinforcement learning via practice and critique advice. In AAAI, 2010. <br />
<br />
[4] A. Thomaz and C. Breazeal. Reinforcement learning with human teachers: Evidence of feedback and guidance. In AAAI, 2006. <br />
<br />
[5] Richard Maclin and Jude W. Shavlik. Incorporating advice into agents that learn from reinforcements. In National Conference on Artificial Intelligence, pages 694–699, 1994. <br />
<br />
[6] Jason Weston. Dialog-based language learning. In arXiv:1604.06045, 2016. <br />
<br />
[7] Russell Kaplan, Christopher Sauer, and Alexander Sosa. Beating atari with natural language guided reinforcement learning. In arXiv:1704.05539, 2017.<br />
<br />
[8] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, pages 278–287, 1999.<br />
<br />
[9] COCO Dataset http://cocodataset.org/#home<br />
<br />
[10] https://en.wikipedia.org/wiki/BLEU<br />
<br />
[11] Papineni, K., Roukos, S., Ward, T., Henderson, J and Reeder, F. (2002). “Corpus-based Comprehensive and Diagnostic MT Evaluation: Initial Arabic, Chinese, French, and Spanish Results” in Proceedings of Human Language Technology 2002, San Diego, pp. 132–137<br />
<br />
[12] Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "Re-evaluating the Role of BLEU in Machine Translation Research" in 11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006 pp. 249–256<br />
<br />
[13] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, Kevin Murphy. "Improved Image Captioning via Policy Gradient optimization of SPIDEr". Under review for ICCV 2017.<br />
<br />
[14] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille. "Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)".arXiv:1412.6632<br />
<br />
[15] W. Bradley Knox and Peter Stone. Reinforcement learning from simultaneous human and mdp reward. In Intl. Conf. on Autonomous Agents and Multiagent Systems, 2012.<br />
<br />
= Appendix = <br />
<br />
===C Examples of Collected Feedback ===<br />
The authors provide examples of feedback collected for the reference (MLE) model <br />
<br />
[[File:appendix_c.PNG]]</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30417Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:51:52Z<p>Amanjhunjhunwala: /* Zero-shot and Adaptation Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
<br />
[[File:MRL9.png|border|center|800px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
[[File:MRL10.png|border|right|400px]]<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|left|320px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30416Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:51:30Z<p>Amanjhunjhunwala: /* Ablations */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
<br />
[[File:MRL9.png|border|center|800px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
[[File:MRL10.png|border|right|400px]]<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30415Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:51:20Z<p>Amanjhunjhunwala: /* Multitask Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
<br />
[[File:MRL9.png|border|center|800px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30414Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:50:46Z<p>Amanjhunjhunwala: /* Multitask Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
<br />
[[File:MRL9.png|border|center|800px]]<br />
[[File:MRL10.png|border|right|400px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30413Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:50:22Z<p>Amanjhunjhunwala: /* Environments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|border|right|400px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30412Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:49:59Z<p>Amanjhunjhunwala: /* Multitask Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
[[File:MRL9.png|border|right|800px]]<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|border|right|400px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30411Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:49:45Z<p>Amanjhunjhunwala: /* Multitask Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
[[File:MRL9.png|border|right|800px]]<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|border|right|450px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30410Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:49:32Z<p>Amanjhunjhunwala: /* Environments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
[[File:MRL9.png|border|right|800px]]<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|border|right|350px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30409Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:49:11Z<p>Amanjhunjhunwala: /* Environments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
[[File:MRL9.png|border|left|800px]]<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|border|right|350px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30408Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:49:00Z<p>Amanjhunjhunwala: /* Environments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
[[File:MRL9.png|border|left|600px]]<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|border|right|350px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30407Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:48:32Z<p>Amanjhunjhunwala: /* Zero-shot and Adaptation Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|border|left|400px]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|border|right|350px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|right|300px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30406Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:48:11Z<p>Amanjhunjhunwala: /* Multitask Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|border|left|400px]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|border|right|350px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30405Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:47:28Z<p>Amanjhunjhunwala: /* Environments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|border|left|400px]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30404Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:47:15Z<p>Amanjhunjhunwala: /* Experiments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL0.png|border|left|400px]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30403Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:46:39Z<p>Amanjhunjhunwala: /* Experiments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30402Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:45:48Z<p>Amanjhunjhunwala: /* Introduction & Background */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|frame]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30401Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:45:27Z<p>Amanjhunjhunwala: /* Introduction & Background */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|border|right|400px]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|frame]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30400Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:44:14Z<p>Amanjhunjhunwala: /* Introduction & Background */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|border|350px]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|frame]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30399Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:44:04Z<p>Amanjhunjhunwala: /* Introduction & Background */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|border|75px]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|frame]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30398Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:43:50Z<p>Amanjhunjhunwala: /* Introduction & Background */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|border|10px]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|frame]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30397Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:43:16Z<p>Amanjhunjhunwala: /* Introduction & Background */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|frame|10px]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|frame]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30396Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:43:04Z<p>Amanjhunjhunwala: /* Introduction & Background */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|frame|50px]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|frame]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30395Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:40:12Z<p>Amanjhunjhunwala: /* Zero-shot and Adaptation Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|frame]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|frame]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided leaving the model to learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30394Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:35:47Z<p>Amanjhunjhunwala: /* Experiments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|frame]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|frame]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL9.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the model’s ability to generalize beyond the standard training condition is examined. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided and the model must learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30393Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:35:06Z<p>Amanjhunjhunwala: /* Environments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|frame]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL8.png|frame|20x20px]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the model’s ability to generalize beyond the standard training condition is examined. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided and the model must learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30392Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:34:55Z<p>Amanjhunjhunwala: /* Environments */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|frame]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL8.png|frame|50x50px]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the model’s ability to generalize beyond the standard training condition is examined. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided and the model must learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30391Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:33:21Z<p>Amanjhunjhunwala: /* Multitask Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|frame]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL8.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model are shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
In addition to the overall modular parameter tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. The next experiments investigate the extent to which these are necessary for good performance.<br />
<br />
To evaluate the the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalised advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|frame]]<br />
In the final experiments, the model’s ability to generalize beyond the standard training condition is examined. Consider two tests of generalization: a zero-shot setting, in which the model is provided a sketch for the new task and must immediately achieve good performance, and a adaptation setting, in which no sketch is provided and the model must learn the form of a suitable sketch via interaction in the new task.They hold out two length-four tasks from the full inventory used in Section 4.3, and train on the remaining tasks. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) in order to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in Section 3.1. Results are shown in Table 1. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural subpolicy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable subpolicies which can be employed for zero-shot generalisation when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for tempo-ral abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.</div>Amanjhunjhunwalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=30390Modular Multitask Reinforcement Learning with Policy Sketches2017-11-15T04:32:54Z<p>Amanjhunjhunwala: </p>
<hr />
<div>='''Introduction & Background'''=<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
[[File:MRL0.png|frame]]<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describe its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks : "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function, and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In a adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize over all $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment, and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000, and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
[[File:MRL8.png|frame]]<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
The maze environment is very similar to “light world” described by [4]. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
[[File:MRL10.png|frame]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically,they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task / subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features an