FeUdal Networks for Hierarchical Reinforcement Learning

From statwiki
Revision as of 03:43, 3 November 2017 by Aravindbk (talk | contribs) (Created page with "== Introduction == Even though deep reinforcement learning has been hugely successful in a variety of domains, it has not been able to succeed in environments which have spar...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

Even though deep reinforcement learning has been hugely successful in a variety of domains, it has not been able to succeed in environments which have sparsely spaced reward signals and encounters the major challenge of long-term credit assignment, where the agent is not able to attribute a reward to an action taken several timesteps back.

This paper proposes a hierarchical reinforcement learning architecture (HRL), called FeUdal Networks (FuN), which has been inspired from Feudal Reinforcement Learning (FRL)(ref Hinton). It is a fully-differentiable neural network with two levels of hierarchy – a Manager module at the top level and a Worker module below. The Manager sets abstract goals, which are learned, at a lower temporal resolution in a latent state-space. The Worker operates at a higher temporal resolution and produces primitive actions at every tick of the environment, motivated to follow the goals received from Manager, by an intrinsic reward.

The key contributions of the authors in this paper are: (1) A consistent, end-to-end differentiable FRL inspired HRL. (2) A novel, approximate transition policy gradient update for training the Manager (3) The use of goals that are directional rather than absolute in nature. (4) Dilated LSTM – a novel RNN design for the Manager that allows gradients to flow through large hops in time.

The experiments conducted on several tasks which involve sparse rewards show that FuN significantly improves long-term credit assignment and memorization.

Related Work

Several hierarchical reinforcement learning models were proposed to solve this problem. The options framework (ref Sutton et al., 1999; Precup, 2000) considers the problem with a two-level hierarchy, with options being typically learned using sub-goals and ‘pseudo-rewards’ that are provided explicitly. Whereas, the option-critic architecture(Bacon et al., 2017) uses the policy gradient theorem for learning options in an end-to-end fashion. A problem with learning options end-to-end is that they tend to a trivial solution where: (i) only one option is active, which solves the whole task; (ii) a policy-over-options changes options at every step, micro-managing the behavior. The authors state that the option-criric architectire is the only other end-to-end trainable system with sub-policies.

Non-hierarchical deep RL (non-HRL) methods using auxiliary losses and rewards such as pseudo count for exploration. (ref Bellemare et al., 2016a) have significantly improved results by stimulating agents to explore new parts of the state space. The UNREAL agent (Jaderberget al., 2016) is another non-HRL method that showed a strong improvement using unsupervised auxiliary tasks.

Model

A high-level explanation of the model is as follows: The Manager computes a latent state representation [math]\displaystyle{ s_t }[/math] and outputs a goal vector [math]\displaystyle{ g_t }[/math] . The Worker outputs actions based on the environment observation, its own state, and the Manager’s goal. A perceptual module computes intermediate representation, [math]\displaystyle{ z_t }[/math] of the environment observation [math]\displaystyle{ x_t }[/math]. The Manager’s goals [math]\displaystyle{ g_t }[/math] are trained using an approximate transition policy gradient. The Worker is then trained via intrinsic reward which stimulates it to output actions that will achieve the goals set by the Manager.

Manager and Worker are recurrent networks ([math]\displaystyle{ {h^M} }[/math] and [math]\displaystyle{ {h^W} }[/math] being their internal states). [math]\displaystyle{ \phi }[/math] is a linear transform that maps a goal [math]\displaystyle{ g_t }[/math] into an embedding vector [math]\displaystyle{ w_t \in {R^k} }[/math] , which is then combined with matrix [math]\displaystyle{ U_t }[/math] (Workers output) to produce policy [math]\displaystyle{ \pi }[/math] – vector of probabilities over primitive actions.

File:model equations.jpg