Deep Exploration via Bootstrapped DQN: Difference between revisions

From statwiki
Jump to navigation Jump to search
(edited summary a bit)
Line 1: Line 1:


== Gist ==
=== Status: In Progress ===


Efficient exploration remains a major challenge for reinforcement learning. Common dithering strategies for exploration, like $\epsilon$-greedy and Boltzmann Exploration, do not carry out deep exploration. So you need exponentially more data to train your agent. Also, most algorithms for statistically efficient reinforcement learning are not computationally tractable in complex environments. A list of available exploration strategies are available on [https://web.stanford.edu/class/msande338/lec9.pdf this page].
== Intro to Reinforcement Learning ==


Randomized value functions offer a promising approach to efficient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. The authors propose '''bootstrapped DQN''' as a first step towards addressing such contexts. They go on to demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than ''any dithering strategy''.
In reinforcement learning, an agent interacts with an environment with the goal to maximize its long term reward. This is the same as [https://en.wikipedia.org/wiki/Multi-armed_bandit multi armed bandit problem]. In a multi armed bandit problem, there is a gambler and there are $n$ slot machines, and the gambler can choose to play any specific slot machine at any time. All the slot machines have their own probability distributions by which they churn out rewards, but this is unknown to the gambler. So the question is, how can the gambler learn how to get the maximum long term reward?
 
There are two things the gambler can do at any instance: either he can try a new slot machine, or he can play the slot machine he has tried before (and he knows he will get some reward). However, even though trying a new slot machine feels like it would bring less reward to the gambler, it is possible that the gambler finds out a new slot machine that gives a better reward than the current best slot machine. This is the dilemma of '''exploration vs exploitation'''. Trying out a new slot machine is '''exploration''', while redoing the best move so far is '''exploiting''' the currently understood perception of the reward.
 
[[File:multiarmedbandit.jpg|thumb|Source: [https://blogs.mathworks.com/images/loren/2016/multiarmedbandit.jpg blogs.mathworks.com]]]
 
There are many strategies to approach this '''exploration-exploitation dilemma'''. A simple list of strategies can be found online [https://web.stanford.edu/class/msande338/lec9.pdf here]. A lot of these approaches are provably efficient, but assume that the state space is not very large. Strategies for large state spaces often just either ignore exploration, or do something naive like $\epsilon$-greedy, where you exploit with $1-\epsilon$ probability and explore "randomly" in rest of the cases.
 
This paper tries to use a Thompson sampling like approach to make decisions.
 
== Thompson Sampling<sup>[[#References|[1]]]</sup> ==
 
In Thompson sampling, our goal is to reach a belief that resembles the truth. Let's consider a case of coin tosses (2-armed bandit). Suppose we want to be able to reach a satisfactory pdf for $\mathbb{P}_h$ (heads). Assuming that this is a Bernoulli bandit problem, i.e. the rewards are $0$ or $1$, we can start off with $\mathbb{P}_h^{(0)}=\beta(1,1)$. The $\beta(x,y)$ distribution is a very good choice for a possible pdf because it works well for Bernoulli rewards. Further $\beta(1,1)$ is the uniform distribution $\mathbb{N}(0,1)$.
 
Now, at every iteration $t$, we observe the reward $R^{(t)}$ and try to make our belief close to the truth by doing a Bayesian computation. Assuming $p$ is the probability of getting a heads,
 
$$
\begin{align*}
\mathbb{P}(R|D) &\propto \mathbb{P}(D|R) \cdot \mathbb{P}(R) \\
\mathbb{P}_h^{(t+1)}&\propto \mbox{likelihood}\cdot\mbox{prior} \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \mathbb{P}_h^{(t)} \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \beta(x_t, y_t) \\
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot p^{x_t-1}(1-p)^{y_t-1} \\
&\propto p^{x_t+R^{(t)}-1}(1-p)^{y_t+R^{(t)}-1} \\
&\propto \beta(x_t+R^{(t)}, y_t+R^{(t)})
\end{align*}
$$
 
[[File:thompson sampling coin example.png|thumb||||600px|Source: [https://www.quora.com/What-is-Thompson-sampling-in-laymans-terms Quora]]]
 
This means that with successive sampling, our belief can become better at approximating the truth. There are similar update rules if we use a non Bernoulli setting, say, Gaussian. In the Gaussian case, we start with $\mathbb{P}_h^{(0)}=\mathbb{N}(0,1)$ and given that $\mathbb{P}_h^{(t)}\propto\mathbb{N}(\mu, \sigma)$ it is possible to show that the update rule looks like
 
$$
\mathbb{P}_h^{(t+1)} \propto \mathbb{N}\bigg(\frac{t\mu+R^{(t)}}{t+1},\frac{\sigma}{\sigma+1}\bigg)
$$
 
=== How can we use this in reinforcement learning? ===
 
We can use this idea to decide when to explore and when to exploit. We start with an initial belief, choose an action, observe the reward and based on the kind of reward, we update our belief about what action to choose next.
 
== References ==
 
1. [https://bandits.wikischolars.columbia.edu/file/view/Lecture+4.pdf Learning and optimization for sequential decision making, Columbia University, Lec 4]

Revision as of 19:26, 15 November 2017

Status: In Progress

Intro to Reinforcement Learning

In reinforcement learning, an agent interacts with an environment with the goal to maximize its long term reward. This is the same as multi armed bandit problem. In a multi armed bandit problem, there is a gambler and there are $n$ slot machines, and the gambler can choose to play any specific slot machine at any time. All the slot machines have their own probability distributions by which they churn out rewards, but this is unknown to the gambler. So the question is, how can the gambler learn how to get the maximum long term reward?

There are two things the gambler can do at any instance: either he can try a new slot machine, or he can play the slot machine he has tried before (and he knows he will get some reward). However, even though trying a new slot machine feels like it would bring less reward to the gambler, it is possible that the gambler finds out a new slot machine that gives a better reward than the current best slot machine. This is the dilemma of exploration vs exploitation. Trying out a new slot machine is exploration, while redoing the best move so far is exploiting the currently understood perception of the reward.

Source: blogs.mathworks.com

There are many strategies to approach this exploration-exploitation dilemma. A simple list of strategies can be found online here. A lot of these approaches are provably efficient, but assume that the state space is not very large. Strategies for large state spaces often just either ignore exploration, or do something naive like $\epsilon$-greedy, where you exploit with $1-\epsilon$ probability and explore "randomly" in rest of the cases.

This paper tries to use a Thompson sampling like approach to make decisions.

Thompson Sampling[1]

In Thompson sampling, our goal is to reach a belief that resembles the truth. Let's consider a case of coin tosses (2-armed bandit). Suppose we want to be able to reach a satisfactory pdf for $\mathbb{P}_h$ (heads). Assuming that this is a Bernoulli bandit problem, i.e. the rewards are $0$ or $1$, we can start off with $\mathbb{P}_h^{(0)}=\beta(1,1)$. The $\beta(x,y)$ distribution is a very good choice for a possible pdf because it works well for Bernoulli rewards. Further $\beta(1,1)$ is the uniform distribution $\mathbb{N}(0,1)$.

Now, at every iteration $t$, we observe the reward $R^{(t)}$ and try to make our belief close to the truth by doing a Bayesian computation. Assuming $p$ is the probability of getting a heads,

$$ \begin{align*} \mathbb{P}(R|D) &\propto \mathbb{P}(D|R) \cdot \mathbb{P}(R) \\ \mathbb{P}_h^{(t+1)}&\propto \mbox{likelihood}\cdot\mbox{prior} \\ &\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \mathbb{P}_h^{(t)} \\ &\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \beta(x_t, y_t) \\ &\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot p^{x_t-1}(1-p)^{y_t-1} \\ &\propto p^{x_t+R^{(t)}-1}(1-p)^{y_t+R^{(t)}-1} \\ &\propto \beta(x_t+R^{(t)}, y_t+R^{(t)}) \end{align*} $$

Source: Quora

This means that with successive sampling, our belief can become better at approximating the truth. There are similar update rules if we use a non Bernoulli setting, say, Gaussian. In the Gaussian case, we start with $\mathbb{P}_h^{(0)}=\mathbb{N}(0,1)$ and given that $\mathbb{P}_h^{(t)}\propto\mathbb{N}(\mu, \sigma)$ it is possible to show that the update rule looks like

$$ \mathbb{P}_h^{(t+1)} \propto \mathbb{N}\bigg(\frac{t\mu+R^{(t)}}{t+1},\frac{\sigma}{\sigma+1}\bigg) $$

How can we use this in reinforcement learning?

We can use this idea to decide when to explore and when to exploit. We start with an initial belief, choose an action, observe the reward and based on the kind of reward, we update our belief about what action to choose next.

References

1. Learning and optimization for sequential decision making, Columbia University, Lec 4