Learning Combinatorial Optimzation: Difference between revisions
No edit summary |
No edit summary |
||
Line 23: | Line 23: | ||
Travelling Salesman Problem | Travelling Salesman Problem | ||
2) Reinforcement Learning - | 2) Reinforcement Learning - The core concept of Reinforcement Learning is to consider a partially observable Markov Decision Process, and a A Markov decision process is a 5-tuple <math>(S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot),\gamma)</math>, where | ||
* <math>S</math> is a finite set of states (they do not have to be, but for the purpose of this paper, we assume for it to be), | |||
* <math>A</math> is a finite set of actions (generally only feasible actions) (alternatively, <math>A_s</math> is the finite set of actions available from state <math>s</math>), | |||
* <math>P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a)</math> is the probability that action <math>a</math> in state <math>s</math> at time <math>t</math> will lead to state <math>s'</math> at time <math>t+1</math>, | |||
*<math>R_a(s,s')</math> is the immediate reward (or expected immediate reward) received after transitioning from state <math>s</math> to state <math>s'</math>, due to action <math>a</math>, furthermore, it is between two consecutive time periods | |||
*<math>\gamma \in [0,1]</math> is the discount factor, which represents the difference in importance between future rewards and present rewards. | |||
In Reinforcement Learning, the rules are generally stochastic, which means that we associate a probability with choosing an action as opposed to deterministic choice of an action. Some other talks have elucidated about this, however, in detail, the idea is that, to maintain exploration-exploitation tradeoffs it's a good idea to have a list of probabilities as opposed to random values. | |||
Revision as of 22:09, 19 March 2018
Learning Combinatorial Optimization Algorithms Over Graphs
Roles :
Abhi (Graph Theory),
Alvin (Reinforcement Learning/actual paper)
Pranav (actual paper),
Daniel (Conclusion: performance, adv, disadv, criticism)
Intro
1) Graph Theory (MLP, TSP, Maxcut) - Common Problems to Solve are: Minimum Vertex Cover: Given a ‘graph’ G, find the minimum number of vertices to tick, so that every single edge is covered. G=(V,E,w). Where G is the Graph, V are the vertices, E is the edge, and w is the set of weights for the edges
Maximum Cut: Given a ‘graph’ G,
Travelling Salesman Problem
2) Reinforcement Learning - The core concept of Reinforcement Learning is to consider a partially observable Markov Decision Process, and a A Markov decision process is a 5-tuple [math]\displaystyle{ (S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot),\gamma) }[/math], where
- [math]\displaystyle{ S }[/math] is a finite set of states (they do not have to be, but for the purpose of this paper, we assume for it to be),
- [math]\displaystyle{ A }[/math] is a finite set of actions (generally only feasible actions) (alternatively, [math]\displaystyle{ A_s }[/math] is the finite set of actions available from state [math]\displaystyle{ s }[/math]),
- [math]\displaystyle{ P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a) }[/math] is the probability that action [math]\displaystyle{ a }[/math] in state [math]\displaystyle{ s }[/math] at time [math]\displaystyle{ t }[/math] will lead to state [math]\displaystyle{ s' }[/math] at time [math]\displaystyle{ t+1 }[/math],
- [math]\displaystyle{ R_a(s,s') }[/math] is the immediate reward (or expected immediate reward) received after transitioning from state [math]\displaystyle{ s }[/math] to state [math]\displaystyle{ s' }[/math], due to action [math]\displaystyle{ a }[/math], furthermore, it is between two consecutive time periods
- [math]\displaystyle{ \gamma \in [0,1] }[/math] is the discount factor, which represents the difference in importance between future rewards and present rewards.
In Reinforcement Learning, the rules are generally stochastic, which means that we associate a probability with choosing an action as opposed to deterministic choice of an action. Some other talks have elucidated about this, however, in detail, the idea is that, to maintain exploration-exploitation tradeoffs it's a good idea to have a list of probabilities as opposed to random values.
Actual Paper:
Conclusions (Performance, advantages, disadvantages): A3C? S2V?
Criticism: