Learning Combinatorial Optimzation

From statwiki
Revision as of 01:07, 20 March 2018 by Hdpang (talk | contribs)
Jump to navigation Jump to search

Learning Combinatorial Optimization Algorithms Over Graphs


Group Members

Abhi (Graph Theory),

Alvin (Reinforcement Learning/actual paper)

Pranav (actual paper),

Daniel (Conclusion: performance, adv, disadv, criticism)

Introduction and Problem Motivation

1) Graph Theory (MLP, TSP, Maxcut) - Common Problems to Solve are: Minimum Vertex Cover: Given a ‘graph’ G, find the minimum number of vertices to tick, so that every single edge is covered. G=(V,E,w). Where G is the Graph, V are the vertices, E is the edge, and w is the set of weights for the edges

Maximum Cut: Given a ‘graph’ G,

Travelling Salesman Problem

2) Reinforcement Learning - The core concept of Reinforcement Learning is to consider a partially observable Markov Decision Process, and a A Markov decision process is a 5-tuple [math]\displaystyle{ (S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot),\gamma) }[/math], where

  • [math]\displaystyle{ S }[/math] is a finite set of states (they do not have to be, but for the purpose of this paper, we assume for it to be),
  • [math]\displaystyle{ A }[/math] is a finite set of actions (generally only feasible actions) (alternatively, [math]\displaystyle{ A_s }[/math] is the finite set of actions available from state [math]\displaystyle{ s }[/math]),
  • [math]\displaystyle{ P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a) }[/math] is the probability that action [math]\displaystyle{ a }[/math] in state [math]\displaystyle{ s }[/math] at time [math]\displaystyle{ t }[/math] will lead to state [math]\displaystyle{ s' }[/math] at time [math]\displaystyle{ t+1 }[/math],
  • [math]\displaystyle{ R_a(s,s') }[/math] is the immediate reward (or expected immediate reward) received after transitioning from state [math]\displaystyle{ s }[/math] to state [math]\displaystyle{ s' }[/math], due to action [math]\displaystyle{ a }[/math], furthermore, it is between two consecutive time periods
  • [math]\displaystyle{ \gamma \in [0,1] }[/math] is the discount factor, which represents the difference in importance between future rewards and present rewards.

In Reinforcement Learning, the rules are generally stochastic, which means that we associate a probability with choosing an action as opposed to deterministic choice of an action. Some other talks have elucidated about this, however, in detail, the idea is that, to maintain exploration-exploitation tradeoffs it's a good idea to have a list of probabilities as opposed to random values.


Model

Criticism

(Performance, advantages, disadvantages): A3C? S2V?

Conclusions