Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling

To Be Filled In

Introduction and Motivation

In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. In even more recent years, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space, and these methods cannot be directly applied to continuous action spaces.

This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.

Related Work

AlphaGo Lee

AlphaGo Lee (Silver et al., 2016, TODO) refers to an algorithm used to play the game Go, which was able to defeat internation champion Lee Sedol. Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.

The use of both policy and value networks are reflected in this paper's work.

AlphaGo Zero

AlphaGo Zero (Silver et al., 2017, TODO) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks, and is trained on self-play, without the need of expert training.

The unification of networks, and self-play are also reflected in this paper.

Curling Algorithms

Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, TODO) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.

Monte Carlo Tree Search

Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail (TODO), balance exploration of different states, with knowledge of paths of execution through past games.

Curling Physics and Simulation

Several references in the paper refer to the study and simulation of curling physics.

Network Design

The authors design a CNN, called the 'policy-value' network. This network gives a probability distribution of actions, and expected rewards, given an input state. This is trained both to find an optimal policy, and to predict rewards.

Validation

Experimental Procedure and Results

Result Figures

Commentary

Future Work

References

Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)