Deep Reinforcement Learning in Continuous Action Spaces a Case Study in the Game of Simulated Curling: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Add some information on related work)
(Add details on curling)
Line 1: Line 1:
To Be Filled In
To Be Filled In


== Introduction and Motivation ==
= Introduction and Motivation =


In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. In even more recent years, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space, and these methods cannot be directly applied to continuous action spaces.
In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. In even more recent years, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space, and these methods cannot be directly applied to continuous action spaces.
Line 7: Line 7:
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.
This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.


=== Related Work ===
== Curling ==


==== AlphaGo Lee ====
Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the gameplay, and potential challenges/concerns for learning algorithms. A terminology section follows.
 
=== Gameplay ===
 
A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.
 
When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house, and can allow rocks to travel around other rocks. Teammembers are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).
 
Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:
 
* Draw: Throw a rock to a target location
* Freeze: Draw a rock up against another rock
* Takeout: Knock another rock out of the house. Can be combined with different ricochet directions
* Guard: Place a rock in front of another, to block other rocks (ex: takeouts)
 
=== Challenges for AI ===
 
Curling offers many challenges for curling based on its physics and rules. This sections lists a few concerns.
 
The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.
 
Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point, and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.
 
Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.
 
Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.
 
=== Terminology ===
 
* End: A round of the game
* House: The end of the sheet of ice, which contains
* Hammer: The team that throws the last rock of an end 'has the hammer'
* Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.
* Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.
 
 
== Related Work ==
 
=== AlphaGo Lee ===


AlphaGo Lee (Silver et al., 2016, TODO) refers to an algorithm used to play the game Go, which was able to defeat internation champion Lee Sedol. Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.
AlphaGo Lee (Silver et al., 2016, TODO) refers to an algorithm used to play the game Go, which was able to defeat internation champion Lee Sedol. Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.
Line 15: Line 53:
The use of both policy and value networks are reflected in this paper's work.
The use of both policy and value networks are reflected in this paper's work.


==== AlphaGo Zero ====
=== AlphaGo Zero ===


AlphaGo Zero (Silver et al., 2017, TODO) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks, and is trained on self-play, without the need of expert training.
AlphaGo Zero (Silver et al., 2017, TODO) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks, and is trained on self-play, without the need of expert training.
Line 21: Line 59:
The unification of networks, and self-play are also reflected in this paper.
The unification of networks, and self-play are also reflected in this paper.


==== Curling Algorithms ====
=== Curling Algorithms ===


Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, TODO) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.
Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, TODO) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.


==== Monte Carlo Tree Search ====
=== Monte Carlo Tree Search ===


Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail (TODO), balance exploration of different states, with knowledge of paths of execution through past games.
Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail (TODO), balance exploration of different states, with knowledge of paths of execution through past games.


==== Curling Physics and Simulation ====
=== Curling Physics and Simulation ===


Several references in the paper refer to the study and simulation of curling physics.
Several references in the paper refer to the study and simulation of curling physics.




== Network Design==
 
= Methods =
 
== Network Design ==


The authors design a CNN, called the 'policy-value' network. This network gives a probability distribution of actions, and expected rewards, given an input state. This is trained both to find an optimal policy, and to predict rewards.
The authors design a CNN, called the 'policy-value' network. This network gives a probability distribution of actions, and expected rewards, given an input state. This is trained both to find an optimal policy, and to predict rewards.

Revision as of 23:19, 5 November 2018

To Be Filled In

Introduction and Motivation

In recent years, Reinforcement Learning methods have been applied to many different games, such as chess and checkers. In even more recent years, the use of CNN's has allowed neural networks to out-perform humans in many difficult games, such as Go. However, many of these cases involve a discrete state or action space, and these methods cannot be directly applied to continuous action spaces.

This paper introduces a method to allow learning with continuous action spaces. A CNN is used to perform learning on a discretion state and action spaces, and then a continuous action search is performed on these discrete results.

Curling

Curling is a sport played by two teams on a long sheet of ice. Roughly, the goal is for each time to slide rocks closer to the target on the other end of the sheet than the other team. The next sections will provide a background on the gameplay, and potential challenges/concerns for learning algorithms. A terminology section follows.

Gameplay

A game of curling is divided into ends. In each end, players from both teams alternate throwing (sliding) eight rocks to the other end of the ice sheet, known as the house. Rocks must land in a certain area in order to stay in play, and must touch or be inside concentric rings (12ft diameter and smaller) in order to score points. At the end of each end, the team with rocks closest to the center of the house scores points.

When throwing a rock, the curling can spin the rock. This allows the rock to 'curl' its path towards the house, and can allow rocks to travel around other rocks. Teammembers are also able to sweep the ice in front of a moving rock in order to decrease friction, which allows for fine-tuning of distance (though the physics of sweeping are not implemented in the simulation used).

Curling offers many possible high-level actions, which are directed by a team member to the throwing member. An example set of these includes:

  • Draw: Throw a rock to a target location
  • Freeze: Draw a rock up against another rock
  • Takeout: Knock another rock out of the house. Can be combined with different ricochet directions
  • Guard: Place a rock in front of another, to block other rocks (ex: takeouts)

Challenges for AI

Curling offers many challenges for curling based on its physics and rules. This sections lists a few concerns.

The effect of changing actions can be highly nonlinear and discontinuous. This can be seen when considering that a 1-cm deviation in a path can make the difference between a high-speed collision, or lack of collision.

Curling will require both offensive and defensive strategies. For example, consider the fact that the last team to throw a rock each end only needs to place that rock closer than the opposing team's rocks to score a point, and invalidate any opposing rocks in the house. The opposing team should thus be considering how to prevent this from happening, in addition to scoring points themselves.

Curling also has a concept known as 'the hammer'. The hammer belongs to the team which throws the last rock each end, providing an advantage, and is given to the team that does not score points each end. It could very well be good strategy to try not to win a single point in an end (if already ahead in points, etc), as this would give the advantage to the opposing team.

Finally, curling has a rule known as the 'Free Guard Zone'. This applies to the first 4 rocks thrown (2 from each team). If they land short of the house, but still in play, then the rocks are not allowed to be removed (via collisions) until all of the first 4 rocks have been thrown.

Terminology

  • End: A round of the game
  • House: The end of the sheet of ice, which contains
  • Hammer: The team that throws the last rock of an end 'has the hammer'
  • Hog Line: thick line that is drawn in front of the house, orthogonal to the length of the ice sheet. Rocks must pass this line to remain in play.
  • Back Line: think line drawn just behind the house. Rocks that pass this line are removed from play.


Related Work

AlphaGo Lee

AlphaGo Lee (Silver et al., 2016, TODO) refers to an algorithm used to play the game Go, which was able to defeat internation champion Lee Sedol. Two neural networks were trained on the moves of human experts, to act as both a policy network and a value network. A Monte Carlo Tree Search algorithm was used for policy improvement.

The use of both policy and value networks are reflected in this paper's work.

AlphaGo Zero

AlphaGo Zero (Silver et al., 2017, TODO) is an improvement on the AlphaGo Lee algorithm. AlphaGo Zero uses a unified neural network in place of the separate policy and value networks, and is trained on self-play, without the need of expert training.

The unification of networks, and self-play are also reflected in this paper.

Curling Algorithms

Some past algorithms have been proposed to deal with continuous action spaces. For example, (Yammamoto et al, 2015, TODO) use game tree search methods in a discretized space. The value of an action is taken as the average of nearby values, with respect to some knowledge of execution uncertainty.

Monte Carlo Tree Search

Monte Carlo Tree Search algorithms have been applied to continuous action spaces. These algorithms, to be discussed in further detail (TODO), balance exploration of different states, with knowledge of paths of execution through past games.

Curling Physics and Simulation

Several references in the paper refer to the study and simulation of curling physics.


Methods

Network Design

The authors design a CNN, called the 'policy-value' network. This network gives a probability distribution of actions, and expected rewards, given an input state. This is trained both to find an optimal policy, and to predict rewards.


Validation

Experimental Procedure and Results

Result Figures

Commentary

Future Work

References

  1. Lee, K., Kim, S., Choi, J. & Lee, S. "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling." Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2937-2946 (2018)