Learning Combinatorial Optimzation: Difference between revisions

Revision as of 22:09, 19 March 2018

Learning Combinatorial Optimization Algorithms Over Graphs

Roles :

Abhi (Graph Theory),

Alvin (Reinforcement Learning/actual paper)

Pranav (actual paper),

Daniel (Conclusion: performance, adv, disadv, criticism)

Intro

1) Graph Theory (MLP, TSP, Maxcut) - Common Problems to Solve are: Minimum Vertex Cover: Given a ‘graph’ G, find the minimum number of vertices to tick, so that every single edge is covered. G=(V,E,w). Where G is the Graph, V are the vertices, E is the edge, and w is the set of weights for the edges

Maximum Cut: Given a ‘graph’ G,

Travelling Salesman Problem

2) Reinforcement Learning - The core concept of Reinforcement Learning is to consider a partially observable Markov Decision Process, and a A Markov decision process is a 5-tuple [math]\displaystyle{ (S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot),\gamma) }[/math], where

[math]\displaystyle{ S }[/math] is a finite set of states (they do not have to be, but for the purpose of this paper, we assume for it to be),
[math]\displaystyle{ A }[/math] is a finite set of actions (generally only feasible actions) (alternatively, [math]\displaystyle{ A_s }[/math] is the finite set of actions available from state [math]\displaystyle{ s }[/math]),
[math]\displaystyle{ P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a) }[/math] is the probability that action [math]\displaystyle{ a }[/math] in state [math]\displaystyle{ s }[/math] at time [math]\displaystyle{ t }[/math] will lead to state [math]\displaystyle{ s' }[/math] at time [math]\displaystyle{ t+1 }[/math],
[math]\displaystyle{ R_a(s,s') }[/math] is the immediate reward (or expected immediate reward) received after transitioning from state [math]\displaystyle{ s }[/math] to state [math]\displaystyle{ s' }[/math], due to action [math]\displaystyle{ a }[/math], furthermore, it is between two consecutive time periods
[math]\displaystyle{ \gamma \in [0,1] }[/math] is the discount factor, which represents the difference in importance between future rewards and present rewards.

In Reinforcement Learning, the rules are generally stochastic, which means that we associate a probability with choosing an action as opposed to deterministic choice of an action. Some other talks have elucidated about this, however, in detail, the idea is that, to maintain exploration-exploitation tradeoffs it's a good idea to have a list of probabilities as opposed to random values.

Actual Paper:

Conclusions (Performance, advantages, disadvantages): A3C? S2V?

Criticism:

@@ Line 23: / Line 23: @@
 Travelling Salesman Problem
-) Reinforcement Learning -
+) Reinforcement Learning - The core concept of Reinforcement Learning is to consider a partially observable Markov Decision Process, and a A Markov decision process is a 5-tuple <math>(S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot),\gamma)</math>, where
+* <math>S</math> is a finite set of states (they do not have to be, but for the purpose of this paper, we assume for it to be),
+* <math>A</math> is a finite set of actions (generally only feasible actions) (alternatively, <math>A_s</math> is the finite set of actions available from state <math>s</math>),
+* <math>P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a)</math> is the probability that action <math>a</math> in state <math>s</math> at time <math>t</math> will lead to state <math>s'</math> at time <math>t+1</math>,
+*<math>R_a(s,s')</math> is the immediate reward (or expected immediate reward) received after transitioning from state <math>s</math> to state <math>s'</math>, due to action <math>a</math>, furthermore, it is between two consecutive time periods
+*<math>\gamma \in [0,1]</math> is the discount factor, which represents the difference in importance between future rewards and present rewards.
+In Reinforcement Learning, the rules are generally stochastic, which means that we associate a probability with choosing an action as opposed to deterministic choice of an action. Some other talks have elucidated about this, however, in detail, the idea is that, to maintain exploration-exploitation tradeoffs it's a good idea to have a list of probabilities as opposed to random values.

Learning Combinatorial Optimzation: Difference between revisions

Revision as of 22:09, 19 March 2018

Navigation menu

Search