# human-level control through deep reinforcement learning

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

## Introduction

Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. "Reinforcement Learning." Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a desirable reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.

When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. Thus, the models need to be capable of dealing with large volumes of data. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans, as they have only performed well on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. TD-Gammon, a self-teaching backgammon program, achieves master's play. AAAI Techinical Report (1993)</ref>.

In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. "Human-level control through deep reinforcement learning." Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.

## Methodology

### Problem Description

The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function

$Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right]$

which is the maximum cumulative sum of rewards $r_t\,$ discounted by $\,\gamma$ at each timestep $t\,$. This sum can be achieved from a policy $\pi = P\left(a|s\right)$ after making an observation $\,s$ and taking an action $\,a$ <ref name = "main"></ref>.

#### Instability of Neural Networks as Function Estimate

Unfortunately, current methods which use deep networks to estimate $Q\,$suffer from instability or divergence for the following reasons:

1. Correlation within sequence of observations
2. Small updates to $Q\,$can significantly change the policy, and thus the data distribution
3. The action values $Q\,$are correlated with the target values $\,y = r_t + \gamma \max_{a'}Q(s', a')$

#### Overcoming Instability

One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing $e_t = \left(s_t, a_t, r_t, s_{t+1}\right)$ - known as the "experiences" - at each time step in a dataset $D_t = \left(e_1, e_2, \ldots, e_t\right)$. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from $\,D_t$. In practice, only $N$ experiences are stored, where $N$ is some large, finite number (e.g. $N=10^6$). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past $N$ states. This makes it much more unlikely that instability or divergence will occur.

Another method used to combat instability is to use a separate network for generating the targets $y_i$ as opposed to the same network. This is implemented by cloning the network $\,Q$ every $\,C$ iterations, and using this static, cloned network to generate the next $\,C$ target values. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, $\,C = 10^4$.

<references />