# Introduction

An interesting research area in reinforcement learning is developing intelligent agents for playing video games. Before the introduction of deep learning, video game agents were commonly coded based on Monte-Carlo Tree Search of pre-set rules. Recent research has shown deep reinforcement learning to be very successful at playing video games like Atari 2600. To be specific, the method (Figure 1) is called Deep Q-Learning (DQN) which learns the optimal actions based on current observations (raw pixels) #References. However, there are some complex games where DQN fails to learn: some games need to solve a sub-problem without explicit reward or contain irreversible domains, where actions can be catastrophic. A typical example of these games is Sokoban(Wikipedia). Even as humans are playing the game, planning and inference are needed. This kind of game raises challenges to RL.

Figure 1: Deep Q-Learning Architecture

In Reinforcement Learning, the algorithms can be divided into two categories: model-free algorithm and model-based algorithm. The model-based reinforcement learning tries to infer environment to gain the reward while model-free reinforcement learning does not use the environment to learn the action that results in the best reward. DQN, mentioned above(Figure 1), is a model-free method. It takes raw pixels as input and maps them to values or actions. As a drawback, large amounts of training data is required. In addition, the policies are not generalized to new tasks in the same environment. A model-based method is trying to build a model for the environment. By querying the model, agents can avoid irreversible, poor decisions. As an approximation of the environment, it can enable better generalization across states. However, this method only shows success in limited settings, where an exact transition model is given or in simple domains. In complex environments, model-based methods suffer from model errors from function approximation. Currently, there is no model-based method that is robust against imperfections.

In this paper, the authors introduce a novel deep reinforcement learning architecture called Imagination-Augmented Agents(I2As). Literally, this method enables agents to learn to interpret predictions from a learned environment model to construct implicit plans. It is a combination of model-free and model-based aspects. The advantage of this method is that it learns in an end-to-end way to extract information from model simulations without making any assumptions about the structure or the perfections of the environment model. As shown in the results, this method outperforms DQN in the games: Sokoban, and MniniPacman. In addition, the experiments all show that I2A is able to successfully use imperfect models.

# Motivation

Although the structure of this method is complex, the motivation is intuitive: since the agent suffers from irreversible decisions, attempts in simulated stated may be helpful. To improve the expensive search space in traditional MCTS methods, adding decision from policy network can reduce search steps. In order to keep context information, rollout results are encoded by an LSTM encoder. The final output is combining the result from model-free network and model-based network.

# Related Work

There are some works that try to apply deep learning to model-based reinforcement learning. The popular approach is to learn a neural network from the environment and apply the network in classical planning algorithms. These works can not handle the mismatch between the learned model and the ground truth. Liu et al.(2017) use context information from trajectories, but in terms of imitation learning.

To deal with imperfect models, Deisenroth and Rasmussen(2011) try to capture model uncertainty by applying high-computational Gaussian Process models. In order to develop such a policy search method, the authors of this paper used analytic gradients of an approximation to the expected return for indirect policy search. This means by learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, this policy search method can cope with very little data and facilitates learning from scratch in only a few trials.

Similar ideas can be found in a study by Hamrick et al.(2017): they present a neural network that queries expert models, but just focus on meta-control for continuous contextual bandit problems. Pascanu et al.(2017) extend this work by focusing on explicit planning in sequential environments.

# Approach

The summary of the architecture of I2A can be seen in Figure 2.

Figure 2: The Architecture of I2A

The observation $O_t$ (Figure 2 right) is fed into two paths, the model-free path is just common DQN which predicts the best action given $O_t$, whereas the model-based path performs a rollout strategy, the aggregator combines the $n$ rollout encoded outputs($n$ equals to the number of actions in the action space), and forwards the results to next layer. Together they are used to generate a policy function $\pi$ to output an action. In each rollout operation, the imagination core is used to predict the future state and reward.

### Imagination Core

The imagination core(Figure 2 left) is the key role in the model-based path. It consists of two parts: environment model and rollout policy. The former is an approximation of the environment and the latter is used to simulate imagined trajectories, which are interpreted by a neural network and provided as additional context to a policy network.

#### environment model

In order to augment agents with imagination, the method relies on environment models that, given current information, can be queried to make predictions about the future. In this work, the environment model is built based on action-conditional next-step predictors, which receive input contains current observation and current action, and predict the next observation and the next reward(Figure 3).

Figure 3: Environment Model

#### rollout policy

The rollout process is regarded as the simulated trajectories. In this work, the rollout is performed for each possible action in the environment.

A rollout policy $\hat \pi$ is a function that takes current observation $O$ and outputs an action $a$ that potentially leads to maximal reward. In this architecture, the rollout policy can be a DQN network. In the experiment, the rollout pocily$\hat \pi$ is broadcasted and shared. After experiments on the types of rollout policies(random, pre-trained), the authors found the efficient strategy is to distill the policy into a model-free policy, which consists in creating a smaill model-free network $\hat \pi(O_t)$, and adding to the total loss.

Together as the imagination core, these two parts produces $n$ trajectories $\hat \tau_1,...,\hat \tau_n$. Each imagined trajectory $\hat \tau$ is a sequence of features $(\hat f_{t+1},...,\hat f_{t+\tau})$, where $t$ is the current time, $\tau$ the length of rollout, and $\hat f_{t+i}$ the output of the environment model(the predicted observation and reward). In order to guarantee success in imperfections, the architecture does not assume the learned model to be perfect. The output will not only depend on the predicted reward.

### Trajectories Encoder

From the intuition to keep the sequence information in the trajectories, the architecture uses a rollout encoder $\varepsilon$ that processes the imagined rollout as a whole and learns to interpret it(Figure 2 middle). Each trajectory is encoded as a rollout embedding $e_i=\varepsilon(\hat \tau_i)$. Then, the aggregator $A$ combines the rollout embedding s into a single imagination code $c_{ia}=A(e_1,...,e_n)$. In the experiments, the encoder is an LSTM that takes the predicted output from environment model as the input. One observation is that the order of the sequence $\hat f_{t+1}$ to $\hat f_{t+\tau}$ makes relatively little impact on the performance. The encodes mimics the Bellman type backup operations in DQN.

### Model-Free Path

The model-free path contains a network that only takes the current observation as input that generates the potential optimal action. This network can be same as the one in imagination core.

In conclusion, the I2A learns to combine information for two paths, and without the model-based path, I2A simply reduce to a standard model-free network(such as A3C). The imperfect approximation results in a rollout policy with higher entropy, potentially striking a balance between exploration and exploitation.

# Experiment

These following experiments were tested in Sokoban and MiniPacman games. All results are averages taken from top three agents. These agents were trained over 32 to 64 workers, and the network was optimized by RMSprop. As the pre-training strategy, the training data of I2A was pre-generated from trajectories of a partially trained standard model-free agent, the data is also taken into account for the budget. The total number of frames that were needed in pre-training is counted in the later process. Meanwhile, the authors show that the environment model can be reused to solve multiple tasks in the same environment.

In the game Sokoban, the environment is a 10 x 10 grid world. All agents were trained directly on raw pixels(image size 80 x 80 with 3 channels). To make sure the network is not just simply "memorize" all states, the game procedurally generates a new level each episode. Out of 40 million levels generated, less than 0.7% were repeated. Therefore, a good agent should solve the unseen level as well.

The reward settings for reinforcement learning algorithms are as follows:

• Every time step, a penalty of -0.1 is applied to the agent.(encourage agents to finish levels faster)
• Whenever the agent pushes a box on target, it receives a reward of +1.(encourage agents to push boxes onto targets)
• Whenever the agent pushes a box off target, it receives a penalty of -1.(avoid artificial reward loop that would be induced by repeatedly pushing a box off and on target)
• Finishing the level gives the agent a reward of +10 and the level terminates.(strongly reward solving a level)

To show the advantage of I2A, the authors set a model-free standard architecture as a baseline. The architecture is a multi-layer convolutional neural network (CNN), taking the current observation $O_t$ as input, followed by a fully connected (FC) hidden layer. This FC layer feeds into two heads: into an FC layer with one output per action computing the policy logits $\log \pi(a_t|O_t, \theta)$; and into another FC layer with a single output that computes the value function $V(O_t; \theta_v)$.

• for MiniPacman: the CNN has two layers, both with 3x3 kernels, 16 output channels and strides 1 and 2; the following FC layer has 256 units
• for Sokoban: the CNN has three layers with kernel sizes 8x8, 4x4, 3x3, strides of 4, 2, 1 and number of output channels 32, 64, 64; the following FC has 512 units

### Sokoban

The environment model for Sokoban is shown in figure 4

Figure 4: The Sokoban environment model

Besides, to demonstrate the influence of larger architecture in I2A, the authors set a copy-model agent that uses the same architecture of I2A but the environment model is replaced by identical map. This agent is regarded as an I2A agent without imagination.

Figure 5: Sokoban learning curves. Left: training curves of I2A and baselines. Right: I2A training curves for various values of imagination depth

The results are shown in Figure 4(left). I2A agents can solve much more levels compared to common DQN. Also, it far outperforms the copy-model. The authors also trained an I2A where the environment model was predicting no rewards, only observations. This also performed worse. However, after much longer training (3e9 steps), these agents did recover the performance of the original I2A, which was never the case for the baseline agent even with that many steps. Hence, reward prediction is helpful but not absolutely necessary in this task, and imagined observations alone are informative enough to obtain high performance on Sokoban. Note this is in contrast to many classical planning and model-based reinforcement learning methods, which often rely on reward prediction.

#### Length of Rollout

A further experiment was investigating how the length of individual rollouts affects performance. The authors performed a parameter searching. Figure 5(right) shows the influence of the rollout length. The strategy using 3 rollout steps improves the speed of learning and improves the performance significantly than 1 step, and 5 is the optimal number. This implies rollout can be very helpful and informative. This rollout enables the agent to learn moves it cannot recover from.

Figure 6: Experiments with a noisy environment model Left: each row shows an example 5-step rollout after conditioning on an environment observation. Errors accumulate and lead to various artifacts, including missing or duplicate sprites. Right: comparison of Monte-Carlo (MC) search and I2A when using either the accurate or the noisy model for rollouts.

#### Imperfections

To demonstrate I2A can handle less reliable predictions, the authors set experiment where the I2A used a poor environment model(smaller number of parameters), where the error may accumulate across the rollout(Figure 6 left). As shown in figure 6(right), even with relatively poor environment model, the performance of I2A is stable, unlike traditional Monte-Carlo search, which explicitly estimates the value of each action from rollouts, rather than learning an arbitrary encoding of the rollouts. An interesting result is that a rollout length 5 no longer outperforms a length of 3, which matches our common sense. The authors suggest that it is learning a rollout encoder that enables I2As to deal with imperfect model predictions.

#### Perfections

As I2A shows the robustness towards environment models, the authors tested an I2A agent with a nearly perfect environment model, and the results are in Table 1 and Table 2. Traditional Mento-Carlo Tree Search is tested as the baseline. From the table, although it is able to solve many levels, the search steps are very huge. On the contrary, I2A with the nearly perfect model can achieve the same fraction with much fewer steps.

#### Generalization

Lastly, the authors probe the generalization capabilities of I2As, beyond handling random level layouts in Sokoban. The agents were trained on levels with 4 boxes. Table 2 shows the performance of I2A when such an agent was tested on levels with different numbers of boxes, and that of the standard model-free agent for comparison. It turns out that I2As generalizes well; at 7 boxes, the I2A agent is still able to solve more than half of the levels, nearly as many as the standard agent on 4 boxes.

### MiniPacman

MiniPacman is a game modified from the classical game PacMan. In the game(Figure 8, left), the player explores a maze that contains food while being chased by ghosts. The maze also contains power pills; when eaten, for a fixed number of steps, the player moves faster, and the ghosts run away and can be eaten. These dynamics are common to all tasks. Each task is defined by a vector $w \in R^5$, associating a reward to each of the following five events: moving, eating food, eating a power pill, eating a ghost, and being eaten by a ghost. The goal of this part is the attempt that tries to apply the same I2A model to different tasks. The five tasks are described as follows:

• Regular: level is cleared when all the food is eaten;
• Avoid: level is cleared after 128 steps;
• Hunt: level is cleared when all ghosts are eaten or after 80 steps.
• Ambush: level is cleared when all ghosts are eaten or after 80 steps.
• Rush: level is cleared when all power pills are eaten.
Table 3: the reward settings in different tasks

Different from the task in Sokoban, in order to capture long-range dependencies across pixels, the authors also made use of a layer that is called pool-and-inject, which applies global max-pooling over each feature map and broadcasts the resulting values as feature maps of the same size and concatenates the result to the input. Pool-and-inject layers are therefore size-preserving layers which communicate the max-value of each layer globally to the next convolutional layer. The environment model for MiniPacman is shown in Figure 7.

Figure 7: The MiniPacman environment model

To illustrate the benefits of model-based methods in this multi-task setting, the authors trained a single environment model to predict both observations (frames) and events, where the environment model is effectively shared across all tasks. Results in Figure 7(right) illustrates the benefit of the I2A architecture, outperforming the standard agent in all tasks. Note that for tasks 4 & 5, the rewards are particularly sparse, and the anticipation of ghost dynamics is especially important. The I2A agent can leverage its environment and reward model to explore the environment much more effectively.

Figure 8: Minipacman environment Left: Two frames from a minipacman game: the player is green, dangerous ghosts red, food dark blue, empty corridors black, power pills in cyan. After eating a power pill (right frame), the player can eat the 4 weak ghosts (yellow). Right: Performance after 300 million environment steps for different agents and all tasks. Note I2A clearly outperforms the other two agents on all tasks with sparse rewards.

# Conclusion

In this paper, the authors applied recent success in CNN and reinforcement learning and raised a novel approach, which is a combination of model-free and model-based methods, called Imagination-augmented RL. This approach outperforms model-free baselines in Sokoban and MiniPacman games. As experiments suggest, this method is able to successfully use imperfect models to interpret future states and rewards. The imagination core part is essential in irreversible domains, where actions can have catastrophic outcomes. Compared to traditional Mente-Carlo search methods, the search space in I2A only grows linearly with the extension of the length of rollouts whereas I2As require far fewer function calls. This work may significantly broaden the applicability of model-based RL concepts and ideas.

# Insight

This is a paper with very interesting ideas. However, it seems that the work is really hard to reproduce for an individual researcher. Since the architecture works as a whole, it is very difficult to debug each single part. Meanwhile, the training process is kind of long with up to 1e9 steps, which is also a huge requirement for computing resources.

In terms of the architecture itself, the design the CNN for the tasks seems to be very empirical. The authors did not include the reasons or rules for this part. Yet why authors applied residual connection in this shadow network is unknown. According to the paper, even the CNN network is quite simple, some details in LSTM encoder are omitted. Therefore, the backpropagation process is not so clear across the whole model.

Back to the settings of environment model, the authors used pre-trained model instead of the jointly training way. Would it be hard to train both models simultaneously?

Lastly, the authors raised a new layer as Pool-and-inject layer, the motivation and plausibility are not so clear. It would be better if the authors can compare it with common pooling layer.

In spite of some missing details, this is a solid work with a novel idea and many tricks. In addition, the settings of the experiment are quite inspiring where we can learn from.

# Reference

1. Buesing, L., Badia, A.P., Battaglia, P.W., Guez, A., Heess, N., Li, Y., Pascanu, R., Racanière, S., Reichert, D.P., Rezende, D.J., Silver, D., Vinyals, O., Weber, T., & Wierstra, D. (2017). Imagination-Augmented Agents for Deep Reinforcement Learning. CoRR, abs/1707.06203.
2. YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374, 2017.
3. Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
4. Jessica B. Hamrick, Andy J. Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W. Battaglia. Metacontrol for adaptive imagination-based optimization. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
5. Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, David Reichert, Theophane Weber, Sebastien Racaniere, Lars Buesing, Daan Wierstra, and Peter Battaglia. Learning model-based planning from scratch. arXiv preprint, 2017.
6. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.