importance Sampling and Markov Chain Monte Carlo (MCMC)

From statwiki
Revision as of 08:45, 30 August 2017 by Conversion script (talk | contribs) (Conversion script moved page Importance Sampling and Markov Chain Monte Carlo (MCMC) to importance Sampling and Markov Chain Monte Carlo (MCMC): Converting page titles to lowercase)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Importance Sampling and Markov Chain Monte Carlo - June 4, 2009

Remark 4:

[math]\displaystyle{ I = \displaystyle\int^\ h(x)f(x)\,dx }[/math]

[math]\displaystyle{ = \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx }[/math]
[math]\displaystyle{ = \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i) }[/math] where [math]\displaystyle{ \displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)} }[/math]
[math]\displaystyle{ =\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx} }[/math]

Apply the idea of importance sampling to both numerator and denominator:

[math]\displaystyle{ =\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx} }[/math]
[math]\displaystyle{ = \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)} }[/math]
[math]\displaystyle{ = \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i) }[/math] where [math]\displaystyle{ \displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)} }[/math]

The above results in the following form of Importance Sampling:

[math]\displaystyle{ \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) }[/math] where [math]\displaystyle{ \displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)} }[/math]

This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in Bayesian approach when f is a posterior density function.

Example of Importance Sampling

Estimate [math]\displaystyle{ I = \displaystyle\ Pr (Z\gt 3) }[/math] when [math]\displaystyle{ Z \sim~ N(0,1) }[/math]

[math]\displaystyle{ I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 }[/math]
[math]\displaystyle{ = \displaystyle\int^\infty_3 h(x)f(x)\,dx }[/math]
Define [math]\displaystyle{ h(x) = \begin{cases} 0, & \text{if } x \lt 3 \\ 1, & \text{if } 3 \leq x \end{cases} }[/math]
Approach I: Monte Carlo
[math]\displaystyle{ \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i) }[/math] where [math]\displaystyle{ X \sim~ N(0,1) }[/math]

The idea here is to sample from normal distribution and to count number of observations that is greater than 3.

The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. Foe example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.

This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):

format long
for i = 1:100
   a(i) = sum(randn(100,1)>=3)/100;
end
meanMC  = mean(a)
varMC   = var(a)

On running this, we get [math]\displaystyle{ meanMC = 0.0005 }[/math] and [math]\displaystyle{ varMC \approx 1.31313 * 10^{-5} }[/math]

Approach II: Importance Sampling
[math]\displaystyle{ \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)} }[/math] where [math]\displaystyle{ f(x) }[/math] is standard normal and [math]\displaystyle{ g(x) }[/math] needs to be chosen wisely so that it is similar to the target distribution.
Let [math]\displaystyle{ g(x) \sim~ N(4,1) }[/math]
[math]\displaystyle{ b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)} }[/math]
[math]\displaystyle{ \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i) }[/math]

This example can be illustrated in Matlab using the code below:

for j = 1:100
   N = 100;
   x = randn (N,1) + 4;
     for ii = 1:N
         h = x(ii)>=3;
         b = exp(8-4*x(ii));
         w(ii) = h*b;
     end
  I(j) = sum(w)/N;
end
MEAN = mean(I)
VAR = var(I)

Running the above code gave us [math]\displaystyle{ MEAN \approx 0.001353 }[/math] and [math]\displaystyle{ VAR \approx 9.666 * 10^{-8} }[/math] which is very close to 0, and is much less than the variability observed when using Monte Carlo

Markov Chain Monte Carlo (MCMC)

Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain.

Consider the same integral: [math]\displaystyle{ I = \displaystyle\int^\ h(x)f(x)\,dx }[/math]

Idea: If [math]\displaystyle{ X_1, X_2,...X_N }[/math] is a Markov Chain with stationary distribution f(x), then under some conditions

[math]\displaystyle{ \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I }[/math]


Stochastic Process:
Stochastic Process is a collection of random variables [math]\displaystyle{ \{ X_t : t \in T \} }[/math]

  • State Space Set: X is the set that random variables [math]\displaystyle{ X_t }[/math] takes values from.
  • Indexed Set: T is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.


Example 1
i.i.d random variables

[math]\displaystyle{ \{ X_t : t \in T \}, X_t \in X }[/math]
[math]\displaystyle{ X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow }[/math]State Space
[math]\displaystyle{ T = \{1, 2, 3, 4, 5\} \rightarrow }[/math]Indexed Set


Example 2

[math]\displaystyle{ X_t }[/math]: price of a stock
[math]\displaystyle{ t }[/math]: opening date of the market

Basic Fact: In general, if we have random variables [math]\displaystyle{ X_1,...X_n }[/math]

[math]\displaystyle{ \displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1) }[/math]
[math]\displaystyle{ \displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i) }[/math] where [math]\displaystyle{ \displaystyle Past_i = (X_i-1, X_i-2,...,X_1) }[/math]


Markov Chain:
A special form of stochastic process in which [math]\displaystyle{ X_t }[/math] depends only on [math]\displaystyle{ X_t-1 }[/math].

For example,

[math]\displaystyle{ \displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1) }[/math]


Transition Probability:
The probability of going from one state to another state.

[math]\displaystyle{ p_{ij} = \Pr(X=X_j\mid X= X_i). \, }[/math]


Transition Matrix:
For n states, transition matrix P is an [math]\displaystyle{ N \times N }[/math] matrix with entries [math]\displaystyle{ P_{ij} }[/math] as below:

[math]\displaystyle{ P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\ p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\ \vdots&\vdots&\ddots&\vdots&\ddots\\ p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\ \vdots&\vdots&\ddots&\vdots&\ddots \end{matrix}\right) }[/math]


Transition Matrix Example:
There are n states in a random walk example. The probability of moving to the right state is Pr(heads) = p. And the probability of moving to the left state is Pr(tails) = q = 1-p. Stay at the end states once hit there. The transition matrix is shown as below:

[math]\displaystyle{ P=\left(\begin{matrix}1&0&\dots&0&\dots\\ p&0&q&0&\dots\\ 0&p&0&q&0\dots\\ \vdots&\vdots&\ddots&\vdots&\ddots\\ p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\ \vdots&\vdots&\ddots&\vdots&\ddots\\ 0&0&\dots&0&1 \end{matrix}\right) }[/math]

Markov Chain Definitions - June 9, 2009