Learning Combinatorial Optimzation: Difference between revisions
No edit summary |
No edit summary |
||
Line 2: | Line 2: | ||
== Group Members == | |||
Abhi (Graph Theory), | Abhi (Graph Theory), | ||
Line 12: | Line 12: | ||
Daniel (Conclusion: performance, adv, disadv, criticism) | Daniel (Conclusion: performance, adv, disadv, criticism) | ||
== Introduction and Problem Motivation == | |||
1) Graph Theory (MLP, TSP, Maxcut) - | 1) Graph Theory (MLP, TSP, Maxcut) - | ||
Line 35: | Line 35: | ||
== Model == | |||
== Criticism == | |||
(Performance, advantages, disadvantages): A3C? S2V? | |||
Conclusions | == Conclusions == | ||
Revision as of 01:07, 20 March 2018
Learning Combinatorial Optimization Algorithms Over Graphs
Group Members
Abhi (Graph Theory),
Alvin (Reinforcement Learning/actual paper)
Pranav (actual paper),
Daniel (Conclusion: performance, adv, disadv, criticism)
Introduction and Problem Motivation
1) Graph Theory (MLP, TSP, Maxcut) - Common Problems to Solve are: Minimum Vertex Cover: Given a ‘graph’ G, find the minimum number of vertices to tick, so that every single edge is covered. G=(V,E,w). Where G is the Graph, V are the vertices, E is the edge, and w is the set of weights for the edges
Maximum Cut: Given a ‘graph’ G,
Travelling Salesman Problem
2) Reinforcement Learning - The core concept of Reinforcement Learning is to consider a partially observable Markov Decision Process, and a A Markov decision process is a 5-tuple [math]\displaystyle{ (S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot),\gamma) }[/math], where
- [math]\displaystyle{ S }[/math] is a finite set of states (they do not have to be, but for the purpose of this paper, we assume for it to be),
- [math]\displaystyle{ A }[/math] is a finite set of actions (generally only feasible actions) (alternatively, [math]\displaystyle{ A_s }[/math] is the finite set of actions available from state [math]\displaystyle{ s }[/math]),
- [math]\displaystyle{ P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a) }[/math] is the probability that action [math]\displaystyle{ a }[/math] in state [math]\displaystyle{ s }[/math] at time [math]\displaystyle{ t }[/math] will lead to state [math]\displaystyle{ s' }[/math] at time [math]\displaystyle{ t+1 }[/math],
- [math]\displaystyle{ R_a(s,s') }[/math] is the immediate reward (or expected immediate reward) received after transitioning from state [math]\displaystyle{ s }[/math] to state [math]\displaystyle{ s' }[/math], due to action [math]\displaystyle{ a }[/math], furthermore, it is between two consecutive time periods
- [math]\displaystyle{ \gamma \in [0,1] }[/math] is the discount factor, which represents the difference in importance between future rewards and present rewards.
In Reinforcement Learning, the rules are generally stochastic, which means that we associate a probability with choosing an action as opposed to deterministic choice of an action. Some other talks have elucidated about this, however, in detail, the idea is that, to maintain exploration-exploitation tradeoffs it's a good idea to have a list of probabilities as opposed to random values.
Model
Criticism
(Performance, advantages, disadvantages): A3C? S2V?