Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling
To Be Filled In
Introduction and Motivation
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. In even more recent years, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space, and these methods cannot be directly applied to continuous action spaces.
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.
AlphaGo Lee (Silver et al., 2016, TODO) refers to an algorithm used to play the game Go, which was able to defeat internation champion Lee Sedol. Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.
The use of both policy and value networks are reflected in this paper's work.
AlphaGo Zero (Silver et al., 2017, TODO) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks, and is trained on self-play, without the need of expert training.
The unification of networks, and self-play are also reflected in this paper.
Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, TODO) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.
Monte Carlo Tree Search
Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail (TODO), balance exploration of different states, with knowledge of paths of execution through past games.
Curling Physics and Simulation
Several references in the paper refer to the study and simulation of curling physics.
The authors design a CNN, called the 'policy-value' network. This network gives a probability distribution of actions, and expected rewards, given an input state. This is trained both to find an optimal policy, and to predict rewards.
Experimental Procedure and Results
- Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)