human-level control through deep reinforcement learning

Introduction

Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. "Reinforcement Learning." Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.

When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. TD-Gammon, a self-teaching backgammon program, achieves master's play. AAAI Techinical Report (1993)</ref>.

In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. "Human-level control through deep reinforcement learning." Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.

Methodology

Problem Description

The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function

$Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]$

which is the maximum cumulative sum of rewards $r_t\,$ discounted by $\,\gamma$ at each timestep $t\,$. This sum can be achieved from a policy $\pi = P\left(a|s\right)$ after making an observation $\,s$ and taking an action $\,a$ <ref name = "main"></ref>.

Instability of Neural Networks as Function Estimate

Unfortunately, current methods which use deep networks to estimate $Q\,$suffer from instability or divergence for the following reasons:

1. Correlation within sequence of observations
2. Small updates to $Q\,$can significantly change the policy, and thus the data distribution
3. The action values $Q\,$are correlated with the target values $\,y = r_t + \gamma \max_{a'}Q(s', a')$

Overcoming Instability

One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing $e_t = \left(s_t, a_t, r_t, s_{t+1}\right)$ - known as the "experiences" - at each time step in a dataset $D_t = \left(e_1, e_2, \ldots, e_t\right)$. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from $\,D_t$. In practice, only $N$ experiences are stored, where $N$ is some large, finite number (e.g. $N=10^6$). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past $N$ states. This makes it much more unlikely that instability or divergence will occur.

Another method used to combat instability is to use a separate network for generating the targets $y_i$ as opposed to the same network. This is implemented by cloning the network $\,Q$ every $\,C$ iterations, and using this static, cloned network to generate the next $\,C$ target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, $\,C = 10^4$.

Data & Preprocessing

The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the $m$ most recent frames (here $m=4$), and these are the inputs to the network.

Model Architecture

The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards ($Q\,$) is estimated by a deep convolutional network, and is updated at every step in time.

The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state.

<references />