Modular Multitask Reinforcement Learning with Policy Sketches: Difference between revisions
(Setting up Headers) |
|||
Line 1: | Line 1: | ||
='''Introduction'''= | ='''Introduction'''= | ||
This paper describes a framework for learning compos-able deep subpolicies in a multitask setting, guided only by abstract sketches of high-level behavior. General rein-forcement learning algorithms allow agents to solve tasks in complex environments. But tasks featuring extremely delayed rewards or other long-term structure are often dif-ficult to solve with flat, monolithic policies, and a long line of prior work has studied methods for learning hier-archical policy representations (Sutton et al., 1999; Diet-terich, 2000; Konidaris & Barto, 2007; Hauser et al., 2008). While unsupervised discovery of these hierarchies is possi-ble (Daniel et al., 2012; Bacon & Precup, 2015), practical approaches often require detailed supervision in the form of explicitly specified high-level actions, subgoals, or be-havioral primitives (Precup, 2000). These depend on state representations simple or structured enough that suitable reward signals can be effectively engineered by hand. | |||
But is such fine-grained supervision actually necessary to achieve the full benefits of hierarchy? Specifically, is it necessary to explicitly ground high-level actions into the representation of the environment? Or is it sufficient to simply inform the learner about the abstract structure of policies, without ever specifying how high-level behaviors should make use of primitive percepts or actions? | |||
To answer these questions, we explore a multitask re-inforcement learning setting where the learner is pre-sented with policy sketches. Policy sketches are short, un-grounded, symbolic representations of a task that describe its component parts, as illustrated in Figure 1. While sym-bols might be shared across tasks (get wood appears in sketches for both the make planks and make sticks tasks), the learner is told nothing about what these symbols mean, in terms of either observations or intermediate rewards. | |||
We present an agent architecture that learns from policy sketches by associating each high-level action with a pa-rameterization of a low-level subpolicy, and jointly op-timizes over concatenated task-specific policies by tying parameters across shared subpolicies. We find that this architecture can use the high-level guidance provided by sketches, without any grounding or concrete definition, to dramatically accelerate learning of complex multi-stage be-haviors. Our experiments indicate that many of the benefits to learning that come from highly detailed low-level su-pervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level supervision (i.e. from policy sketches). Crucially, sketches are much easier to produce: they require no modifications to the environment dynam-ics or reward function, and can be easily provided by non-experts. This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. We show that our approach substantially outper-forms purely unsupervised methods that do not provide the learner with any task-specific guidance about how hierar-chies should be deployed, and further that the specific use of sketches to parameterize modular subpolicies makes bet-ter use of sketches than conditioning on them directly. | |||
The present work may be viewed as an extension of recent approaches for learning compositional deep architectures from structured program descriptors (Andreas et al., 2016; Reed & de Freitas, 2016). Here we focus on learning in in-teractive environments. This extension presents a variety of technical challenges, requiring analogues of these methods that can be trained from sparse, non-differentiable reward signals without demonstrations of desired system behavior. | |||
Our contributions are: | |||
A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies. | |||
A concrete recipe for learning from these sketches, built on a general family of modular deep policy rep-resentations and a multitask actor–critic training ob-jective. | |||
The modular structure of our approach, which associates every high-level action symbol with a discrete subpolicy, naturally induces a library of interpretable policy fragments.that are easily recombined. This makes it possible to eval-uate our approach under a variety of different data condi-tions: (1) learning the full collection of tasks jointly via reinforcement, (2) in a zero-shot setting where a policy sketch is available for a held-out task, and (3) in a adapta-tion setting, where sketches are hidden and the agent must learn to adapt a pretrained policy to reuse high-level ac-tions in a new task. In all cases, our approach substantially outperforms previous approaches based on explicit decom-position of the Q function along subtasks (Parr & Russell, 1998; Vogel & Jurafsky, 2010), unsupervised option dis-covery (Bacon & Precup, 2015), and several standard pol-icy gradient baselines. | |||
We consider three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the proper order, and in some cases building intermediate tools that enable the agent to al-ter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff. | |||
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, in-volving sequences of four or five high-level actions, a task-specific agent initially following a random policy essen-tially never discovers the reward signal, so these tasks can-not be solved without considering their hierarchical struc-ture. We have released code at http://github.com/ jacobandreas/psketch. | |||
='''Related Work'''= | ='''Related Work'''= |
Revision as of 15:26, 12 November 2017
Introduction
This paper describes a framework for learning compos-able deep subpolicies in a multitask setting, guided only by abstract sketches of high-level behavior. General rein-forcement learning algorithms allow agents to solve tasks in complex environments. But tasks featuring extremely delayed rewards or other long-term structure are often dif-ficult to solve with flat, monolithic policies, and a long line of prior work has studied methods for learning hier-archical policy representations (Sutton et al., 1999; Diet-terich, 2000; Konidaris & Barto, 2007; Hauser et al., 2008). While unsupervised discovery of these hierarchies is possi-ble (Daniel et al., 2012; Bacon & Precup, 2015), practical approaches often require detailed supervision in the form of explicitly specified high-level actions, subgoals, or be-havioral primitives (Precup, 2000). These depend on state representations simple or structured enough that suitable reward signals can be effectively engineered by hand.
But is such fine-grained supervision actually necessary to achieve the full benefits of hierarchy? Specifically, is it necessary to explicitly ground high-level actions into the representation of the environment? Or is it sufficient to simply inform the learner about the abstract structure of policies, without ever specifying how high-level behaviors should make use of primitive percepts or actions?
To answer these questions, we explore a multitask re-inforcement learning setting where the learner is pre-sented with policy sketches. Policy sketches are short, un-grounded, symbolic representations of a task that describe its component parts, as illustrated in Figure 1. While sym-bols might be shared across tasks (get wood appears in sketches for both the make planks and make sticks tasks), the learner is told nothing about what these symbols mean, in terms of either observations or intermediate rewards.
We present an agent architecture that learns from policy sketches by associating each high-level action with a pa-rameterization of a low-level subpolicy, and jointly op-timizes over concatenated task-specific policies by tying parameters across shared subpolicies. We find that this architecture can use the high-level guidance provided by sketches, without any grounding or concrete definition, to dramatically accelerate learning of complex multi-stage be-haviors. Our experiments indicate that many of the benefits to learning that come from highly detailed low-level su-pervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level supervision (i.e. from policy sketches). Crucially, sketches are much easier to produce: they require no modifications to the environment dynam-ics or reward function, and can be easily provided by non-experts. This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. We show that our approach substantially outper-forms purely unsupervised methods that do not provide the learner with any task-specific guidance about how hierar-chies should be deployed, and further that the specific use of sketches to parameterize modular subpolicies makes bet-ter use of sketches than conditioning on them directly.
The present work may be viewed as an extension of recent approaches for learning compositional deep architectures from structured program descriptors (Andreas et al., 2016; Reed & de Freitas, 2016). Here we focus on learning in in-teractive environments. This extension presents a variety of technical challenges, requiring analogues of these methods that can be trained from sparse, non-differentiable reward signals without demonstrations of desired system behavior.
Our contributions are:
A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.
A concrete recipe for learning from these sketches, built on a general family of modular deep policy rep-resentations and a multitask actor–critic training ob-jective.
The modular structure of our approach, which associates every high-level action symbol with a discrete subpolicy, naturally induces a library of interpretable policy fragments.that are easily recombined. This makes it possible to eval-uate our approach under a variety of different data condi-tions: (1) learning the full collection of tasks jointly via reinforcement, (2) in a zero-shot setting where a policy sketch is available for a held-out task, and (3) in a adapta-tion setting, where sketches are hidden and the agent must learn to adapt a pretrained policy to reuse high-level ac-tions in a new task. In all cases, our approach substantially outperforms previous approaches based on explicit decom-position of the Q function along subtasks (Parr & Russell, 1998; Vogel & Jurafsky, 2010), unsupervised option dis-covery (Bacon & Precup, 2015), and several standard pol-icy gradient baselines.
We consider three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the proper order, and in some cases building intermediate tools that enable the agent to al-ter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, in-volving sequences of four or five high-level actions, a task-specific agent initially following a random policy essen-tially never discovers the reward signal, so these tasks can-not be solved without considering their hierarchical struc-ture. We have released code at http://github.com/ jacobandreas/psketch.