importance Sampling and Markov Chain Monte Carlo (MCMC)
Importance Sampling and Markov Chain Monte Carlo - June 4, 2009
Remark 4:
[math]\displaystyle{ I = \displaystyle\int^\ h(x)f(x)\,dx }[/math]
- [math]\displaystyle{ = \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx }[/math]
- [math]\displaystyle{ = \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i) }[/math] where [math]\displaystyle{ \displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)} }[/math]
- [math]\displaystyle{ =\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx} }[/math]
Apply the idea of importance sampling to both numerator and denominator:
- [math]\displaystyle{ =\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx} }[/math]
- [math]\displaystyle{ = \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)} }[/math]
- [math]\displaystyle{ = \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i) }[/math] where [math]\displaystyle{ \displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)} }[/math]
The above results in the following form of Importance Sampling:
- [math]\displaystyle{ \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) }[/math] where [math]\displaystyle{ \displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)} }[/math]
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in Bayesian approach when f is a posterior density function.
Example of Importance Sampling
Estimate [math]\displaystyle{ I = \displaystyle\ Pr (Z\gt 3) }[/math] when [math]\displaystyle{ Z \sim~ N(0,1) }[/math]
- [math]\displaystyle{ I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 }[/math]
- [math]\displaystyle{ = \displaystyle\int^\infty_3 h(x)f(x)\,dx }[/math]
- Define [math]\displaystyle{ h(x) = \begin{cases} 0, & \text{if } x \lt 3 \\ 1, & \text{if } 3 \leq x \end{cases} }[/math]
Approach I: Monte Carlo
- [math]\displaystyle{ \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i) }[/math] where [math]\displaystyle{ X \sim~ N(0,1) }[/math]
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. Foe example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):
format long for i = 1:100 a(i) = sum(randn(100,1)>=3)/100; end meanMC = mean(a) varMC = var(a)
On running this, we get [math]\displaystyle{ meanMC = 0.0005 }[/math] and [math]\displaystyle{ varMC \approx 1.31313 * 10^{-5} }[/math]
Approach II: Importance Sampling
- [math]\displaystyle{ \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)} }[/math] where [math]\displaystyle{ f(x) }[/math] is standard normal and [math]\displaystyle{ g(x) }[/math] needs to be chosen wisely so that it is similar to the target distribution.
- Let [math]\displaystyle{ g(x) \sim~ N(4,1) }[/math]
- [math]\displaystyle{ b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)} }[/math]
- [math]\displaystyle{ \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i) }[/math]
This example can be illustrated in Matlab using the code below:
for j = 1:100 N = 100; x = randn (N,1) + 4; for ii = 1:N h = x(ii)>=3; b = exp(8-4*x(ii)); w(ii) = h*b; end I(j) = sum(w)/N; end MEAN = mean(I) VAR = var(I)
Running the above code gave us [math]\displaystyle{ MEAN \approx 0.001353 }[/math] and [math]\displaystyle{ VAR \approx 9.666 * 10^{-8} }[/math] which is very close to 0, and is much less than the variability observed when using Monte Carlo
Markov Chain Monte Carlo (MCMC)
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain.
Consider the same integral: [math]\displaystyle{ I = \displaystyle\int^\ h(x)f(x)\,dx }[/math]
Idea: If [math]\displaystyle{ X_1, X_2,...X_N }[/math] is a Markov Chain with stationary distribution f(x), then under some conditions
- [math]\displaystyle{ \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I }[/math]
Stochastic Process:
Stochastic Process is a collection of random variables [math]\displaystyle{ \{ X_t : t \in T \} }[/math]
- State Space Set: X is the set that random variables [math]\displaystyle{ X_t }[/math] takes values from.
- Indexed Set: T is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.
Example 1
i.i.d random variables
- [math]\displaystyle{ \{ X_t : t \in T \}, X_t \in X }[/math]
- [math]\displaystyle{ X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow }[/math]State Space
- [math]\displaystyle{ T = \{1, 2, 3, 4, 5\} \rightarrow }[/math]Indexed Set
Example 2
- [math]\displaystyle{ X_t }[/math]: price of a stock
- [math]\displaystyle{ t }[/math]: opening date of the market
Basic Fact: In general, if we have random variables [math]\displaystyle{ X_1,...X_n }[/math]
- [math]\displaystyle{ \displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1) }[/math]
- [math]\displaystyle{ \displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i) }[/math] where [math]\displaystyle{ \displaystyle Past_i = (X_i-1, X_i-2,...,X_1) }[/math]
Markov Chain:
A special form of stochastic process in which [math]\displaystyle{ X_t }[/math] depends only on [math]\displaystyle{ X_t-1 }[/math].
For example,
- [math]\displaystyle{ \displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1) }[/math]
Transition Probability:
The probability of going from one state to another state.
- [math]\displaystyle{ p_{ij} = \Pr(X=X_j\mid X= X_i). \, }[/math]
Transition Matrix:
For n states, transition matrix P is an [math]\displaystyle{ N \times N }[/math] matrix with entries [math]\displaystyle{ P_{ij} }[/math] as below:
- [math]\displaystyle{ P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\ p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\ \vdots&\vdots&\ddots&\vdots&\ddots\\ p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\ \vdots&\vdots&\ddots&\vdots&\ddots \end{matrix}\right) }[/math]
Transition Matrix Example:
There are n states in a random walk example. The probability of moving to the right state is Pr(heads) = p. And the probability of moving to the left state is Pr(tails) = q = 1-p. Stay at the end states once hit there. The transition matrix is shown as below:
- [math]\displaystyle{ P=\left(\begin{matrix}1&0&\dots&0&\dots\\ p&0&q&0&\dots\\ 0&p&0&q&0\dots\\ \vdots&\vdots&\ddots&\vdots&\ddots\\ p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\ \vdots&\vdots&\ddots&\vdots&\ddots\\ 0&0&\dots&0&1 \end{matrix}\right) }[/math]