human-level control through deep reinforcement learning

From statwiki
Revision as of 21:49, 29 October 2015 by Alcateri (talk | contribs) (Training Details)
Jump to: navigation, search

Introduction

Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. "Reinforcement Learning." Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.

When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. TD-Gammon, a self-teaching backgammon program, achieves master's play. AAAI Techinical Report (1993)</ref>.

In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. "Human-level control through deep reinforcement learning." Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.

Problem Description

The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function

[math]Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right][/math]

which is the maximum cumulative sum of rewards [math]r_t\,[/math] discounted by [math]\,\gamma[/math] at each timestep [math]t\,[/math]. This sum can be achieved from a policy [math]\pi = P\left(a|s\right)[/math] after making an observation [math]\,s[/math] and taking an action [math]\,a[/math] <ref name = "main"></ref>.

Instability of Neural Networks as Function Estimate

Unfortunately, current methods which use deep networks to estimate [math]Q\,[/math]suffer from instability or divergence for the following reasons:

  1. Correlation within sequence of observations
  2. Small updates to [math]Q\,[/math]can significantly change the policy, and thus the data distribution
  3. The action values [math]Q\,[/math]are correlated with the target values [math]\,y = r_t + \gamma \max_{a'}Q(s', a')[/math]

Overcoming Instability

One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing [math]e_t = \left(s_t, a_t, r_t, s_{t+1}\right)[/math] - known as the "experiences" - at each time step in a dataset [math]D_t = \left(e_1, e_2, \ldots, e_t\right)[/math]. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from [math]\,D_t[/math]. In practice, only [math]N[/math] experiences are stored, where [math]N[/math] is some large, finite number (e.g. [math]N=10^6[/math]). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past [math]N[/math] states. This makes it much more unlikely that instability or divergence will occur.

Another method used to combat instability is to use a separate network for generating the targets [math]y_i[/math] as opposed to the same network. This is implemented by cloning the network [math]\,Q[/math] every [math]\,C[/math] iterations, and using this static, cloned network to generate the next [math]\,C[/math] target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, [math]\,C = 10^4[/math].

Data & Preprocessing

The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the [math]m[/math] most recent frames (here [math]m=4[/math]), and these 84x84x4 images are the inputs to the network.

Model Architecture

The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, [math]Q[/math], is estimated by a deep convolutional network, and is updated at every step in time.

The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. What is the best multi-stage architecture for object recognition? Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.

Network Architecture.JPG

Training Details

Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. [math]C[/math] or [math]m[/math] as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. However, the game itself is not otherwise changed.

There is also a frame-skipping technique employed, in which the agent only performs an action every [math]k^{th}[/math] frame to allow the agent to play [math]k[/math] times more games, as the network does not have to be trained on the skipped frames ([math]k=4[/math] here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame.

The agents are trained on 50 million frames of game play, which is about 38 days. The RMSProp <ref>Hinton, Geoffrey et al.Overview of Minibatch Gradient Descent. University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.

Results

Bibliography

<references />