a Deeper Look into Importance Sampling: Difference between revisions

From statwiki
Jump to navigation Jump to search
m (Conversion script moved page A Deeper Look into Importance Sampling to a Deeper Look into Importance Sampling: Converting page titles to lowercase)
 
(38 intermediate revisions by 8 users not shown)
Line 20: Line 20:
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math>
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math>


We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large.
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.


=====Remark 1=====
=====Remark 1=====
Line 27: Line 27:


=====Remark 2=====
=====Remark 2=====
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Analytically, we can show that there is a <math>\displaystyle g</math> that would result in a variance that is minimized.
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.


'''Theorem:''' The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int h(s) f(s) ds}</math>.
=====Remark 3=====
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math>


'''Proof:'''
 
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.
 
=====Proof:=====
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>


Line 37: Line 42:
:: <math>\displaystyle Var[w(x)] </math>
:: <math>\displaystyle Var[w(x)] </math>
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math>
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math>
:: <math>\displaystyle = \int (\frac{f(x)h(x)}{g(x)})^2 g(x) dx - [\int \frac{f(x)h(x)}{g(x)}g(x)dx]^2 </math>
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math>
:: <math>\displaystyle = \int (\frac{f(x)h(x)}{g(x)})^2 g(x) dx - [\int f(x)h(x)]^2 </math>
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math>
 
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.
 
======Aside: Jensen's Inequality======
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math> \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math>
 
=====Proof (cont)=====
Using Jensen's Inequality, <br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) <= p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math>
Therefore
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math>
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br />
and
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math>
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math>
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative.  <br />
 
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br />


As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we need to minimize the first term. But first we must go through Jenson's Inequality.
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.


====Jenson's Inequality====
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.


===[[Continuing on]] - June 4, 2009===
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009===

Latest revision as of 08:45, 30 August 2017

A Deeper Look into Importance Sampling - June 2, 2009

From last class, we have determined that an integral can be written in the form [math]\displaystyle{ I = \displaystyle\int h(x)f(x)\,dx }[/math] [math]\displaystyle{ = \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx }[/math] We continue our discussion of Importance Sampling here.

Importance Sampling

We can see that the integral [math]\displaystyle{ \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx }[/math] is just [math]\displaystyle{ \displaystyle E_g(h(x)) \rightarrow }[/math]the expectation of h(x) with respect to g(x), where [math]\displaystyle{ \displaystyle \frac{f(x)}{g(x)} }[/math] is a weight [math]\displaystyle{ \displaystyle\beta(x) }[/math]. In the case where [math]\displaystyle{ \displaystyle f \gt g }[/math], a greater weight for [math]\displaystyle{ \displaystyle\beta(x) }[/math] will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.

Problem

The method of Importance Sampling is simple but can lead to some problems. The [math]\displaystyle{ \displaystyle \hat I }[/math] estimated by Importance Sampling could have infinite standard error.

Given [math]\displaystyle{ \displaystyle I= \int w(x) g(x) dx }[/math] [math]\displaystyle{ = \displaystyle E_g(w(x)) }[/math] [math]\displaystyle{ = \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) }[/math] where [math]\displaystyle{ \displaystyle w(x)=\frac{f(x)h(x)}{g(x)} }[/math].

Obtaining the second moment,

[math]\displaystyle{ \displaystyle E[(w(x))^2] }[/math]
[math]\displaystyle{ \displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx }[/math]
[math]\displaystyle{ \displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx }[/math]
[math]\displaystyle{ \displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx }[/math]

We can see that if [math]\displaystyle{ \displaystyle g(x) \rightarrow 0 }[/math], then [math]\displaystyle{ \displaystyle E[(w(x))^2] \rightarrow \infty }[/math]. This occurs if [math]\displaystyle{ \displaystyle g }[/math] has a thinner tail than [math]\displaystyle{ \displaystyle f }[/math] then [math]\displaystyle{ \frac{h^2(x)f^2(x)}{g(x)} }[/math] could be infinitely large. The general idea here is that [math]\displaystyle{ \frac{f(x)}{g(x)} }[/math] should not be large.

Remark 1

It is evident that [math]\displaystyle{ \displaystyle g(x) }[/math] should be chosen such that it has a thicker tail than [math]\displaystyle{ \displaystyle f(x) }[/math]. If [math]\displaystyle{ \displaystyle f }[/math] is large over set [math]\displaystyle{ \displaystyle A }[/math] but [math]\displaystyle{ \displaystyle g }[/math] is small, then [math]\displaystyle{ \displaystyle \frac{f}{g} }[/math] would be large and it would result in a large variance.

Remark 2

It is useful if we can choose [math]\displaystyle{ \displaystyle g }[/math] to be similar to [math]\displaystyle{ \displaystyle f }[/math] in terms of shape. Ideally, the optimal [math]\displaystyle{ \displaystyle g }[/math] should be similar to [math]\displaystyle{ \displaystyle \left| h(x) \right|f(x) }[/math], and have a thicker tail. Analytically, we can show that the best [math]\displaystyle{ \displaystyle g }[/math] is the one that would result in a variance that is minimized.

Remark 3

Choose [math]\displaystyle{ \displaystyle g }[/math] such that it is similar to [math]\displaystyle{ \displaystyle \left| h(x) \right| f(x) }[/math] in terms of shape. That is, we want [math]\displaystyle{ \displaystyle g \propto \displaystyle \left| h(x) \right| f(x) }[/math]


Theorem (Minimum Variance Choice of [math]\displaystyle{ \displaystyle g }[/math])

The choice of [math]\displaystyle{ \displaystyle g }[/math] that minimizes variance of [math]\displaystyle{ \hat I }[/math] is [math]\displaystyle{ \displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds} }[/math].

Proof:

We know that [math]\displaystyle{ \displaystyle w(x)=\frac{f(x)h(x)}{g(x)} }[/math]

The variance of [math]\displaystyle{ \displaystyle w(x) }[/math] is

[math]\displaystyle{ \displaystyle Var[w(x)] }[/math]
[math]\displaystyle{ \displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 }[/math]
[math]\displaystyle{ \displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 }[/math]
[math]\displaystyle{ \displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 }[/math]

As we can see, the second term does not depend on [math]\displaystyle{ \displaystyle g(x) }[/math]. Therefore to minimize [math]\displaystyle{ \displaystyle Var[w(x)] }[/math] we only need to minimize the first term. In doing so we will use Jensen's Inequality.

Aside: Jensen's Inequality

If [math]\displaystyle{ \displaystyle g }[/math] is a convex function ( twice differentiable and [math]\displaystyle{ \displaystyle g''(x) \geq 0 }[/math] ) then [math]\displaystyle{ \displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2) }[/math]
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:

[math]\displaystyle{ \displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) }[/math] where [math]\displaystyle{ \alpha_1 + \alpha_2 + ... + \alpha_n = 1 }[/math]
Proof (cont)

Using Jensen's Inequality,

[math]\displaystyle{ \displaystyle g(E[x]) \leq E[g(x)] }[/math] as [math]\displaystyle{ \displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \lt = p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] }[/math]

Therefore

[math]\displaystyle{ \displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 }[/math]
[math]\displaystyle{ \displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 }[/math]

and

[math]\displaystyle{ \displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 }[/math]
[math]\displaystyle{ \displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 }[/math]
[math]\displaystyle{ \displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 }[/math] since [math]\displaystyle{ \displaystyle f }[/math] and [math]\displaystyle{ \displaystyle g }[/math] are density functions, [math]\displaystyle{ \displaystyle f, g }[/math] cannot be negative.

Thus, this is a lower bound on [math]\displaystyle{ \displaystyle E[(w(x))^2] }[/math]. If we replace [math]\displaystyle{ \displaystyle g^*(x) }[/math] into [math]\displaystyle{ \displaystyle E[g^*(x)] }[/math], we can see that the result is as we require. Details omitted.

However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute [math]\displaystyle{ \displaystyle g^* }[/math].

Note: Jensen's inequality is actually unnecessary here. We just use it to get [math]\displaystyle{ E[(w(x))^2] \geq (E[|w(x)|])^2 }[/math], which could be derived using variance properties: [math]\displaystyle{ 0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2 }[/math].

Importance Sampling and Markov Chain Monte Carlo (MCMC) - June 4, 2009