Difference between revisions of "a Deeper Look into Importance Sampling"

From statwiki
Jump to: navigation, search
(Remark 2: Tried to explain Jenson's Inequality and explain how convexity generalizes to higher dimensions)
(Cleaned up latex, added absolute value.)
Line 29: Line 29:
 
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Analytically, we can show that there is a <math>\displaystyle g</math> that would result in a variance that is minimized.
 
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Analytically, we can show that there is a <math>\displaystyle g</math> that would result in a variance that is minimized.
  
'''Theorem:''' The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int h(s) f(s) ds}</math>.
+
'''Theorem:''' The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.
  
 
'''Proof:'''<br />
 
'''Proof:'''<br />
Line 37: Line 37:
 
:: <math>\displaystyle Var[w(x)] </math>
 
:: <math>\displaystyle Var[w(x)] </math>
 
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math>
 
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math>
:: <math>\displaystyle = \int (\frac{f(x)h(x)}{g(x)})^2 g(x) dx - [\int \frac{f(x)h(x)}{g(x)}g(x)dx]^2 </math>
+
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math>
:: <math>\displaystyle = \int (\frac{f(x)h(x)}{g(x)})^2 g(x) dx - [\int f(x)h(x)]^2 </math>
+
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math>
  
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we need to minimize the first term. We will use '''Jenson's Inequality'''.
+
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we need to minimize the first term. We will use '''Jensen's Inequality'''.
  
'''Jenson's Inequality'''<br />
+
'''Jensen's Inequality'''<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x)>=0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) <= \alpha g(x_1) + (1-\alpha) g(x_2)</math><br />
+
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br />
 
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:
 
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) <= \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math> \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math>
+
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math> \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math>
  
Using Jenson's Inequality, <br />
+
Using Jensen's Inequality, <br />
::<math>\displaystyle g(E[x]) <= E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) <= p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math>
+
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) <= p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math>
::<math>\displaystyle E[(w(x))^2] >= (E[\left| w(x) \right|])^2 </math>
+
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math>
::<math>\displaystyle E[(w(x))^2] >= (E[\int \left| \frac{f(x)h(x)}{g(x)} \right|])^2 </math> <br />
+
::<math>\displaystyle E[(w(x))^2] \geq \left( E[\int \left| \frac{f(x)h(x)}{g(x)} \right|] \right)^2 </math> <br />
  
::<math>\displaystyle (E[\int \left| \frac{f(x)h(x)}{g(x)} \right|])^2 </math>
+
::<math>\displaystyle \left(E[\int \left| \frac{f(x)h(x)}{g(x)} \right|] \right)^2 </math>
::<math>\displaystyle = (\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx)^2 </math>
+
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math>
::<math>\displaystyle = (\int \left| h(x) \right| f(x) dx)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative.  <br />
+
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative.  <br />
  
 
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br />
 
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br />

Revision as of 15:13, 8 June 2009

A Deeper Look into Importance Sampling - June 2, 2009

From last class, we have determined that an integral can be written in the form [math]I = \displaystyle\int h(x)f(x)\,dx [/math] [math]= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx[/math] We continue our discussion of Importance Sampling here.

Importance Sampling

We can see that the integral [math]\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx[/math] is just [math] \displaystyle E_g(h(x)) \rightarrow[/math]the expectation of h(x) with respect to g(x), where [math]\displaystyle \frac{f(x)}{g(x)} [/math] is a weight [math]\displaystyle\beta(x)[/math]. In the case where [math]\displaystyle f \gt g[/math], a greater weight for [math]\displaystyle\beta(x)[/math] will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.

Problem

The method of Importance Sampling is simple but can lead to some problems. The [math] \displaystyle \hat I [/math] estimated by Importance Sampling could have infinite standard error.

Given [math]\displaystyle I= \int w(x) g(x) dx [/math] [math]= \displaystyle E_g(w(x)) [/math] [math]= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) [/math] where [math]\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} [/math].

Obtaining the second moment,

[math]\displaystyle E[(w(x))^2] [/math]
[math]\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx[/math]
[math]\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx [/math]
[math]\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx [/math]

We can see that if [math]\displaystyle g(x) \rightarrow 0 [/math], then [math]\displaystyle E[(w(x))^2] \rightarrow \infty [/math]. This occurs if [math]\displaystyle g [/math] has a thinner tail than [math]\displaystyle f [/math] then [math]\frac{h^2(x)f^2(x)}{g(x)} [/math] could be infinitely large. The general idea here is that [math]\frac{f(x)}{g(x)} [/math] should not be large.

Remark 1

It is evident that [math]\displaystyle g(x) [/math] should be chosen such that it has a thicker tail than [math]\displaystyle f(x) [/math]. If [math]\displaystyle f[/math] is large over set [math]\displaystyle A[/math] but [math]\displaystyle g[/math] is small, then [math]\displaystyle \frac{f}{g} [/math] would be large and it would result in a large variance.

Remark 2

It is useful if we can choose [math]\displaystyle g [/math] to be similar to [math]\displaystyle f[/math] in terms of shape. Analytically, we can show that there is a [math]\displaystyle g[/math] that would result in a variance that is minimized.

Theorem: The choice of [math]\displaystyle g[/math] that minimizes variance of [math]\hat I[/math] is [math]\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}[/math].

Proof:
We know that [math]\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} [/math]

The variance of [math]\displaystyle w(x) [/math] is

[math]\displaystyle Var[w(x)] [/math]
[math]\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 [/math]
[math]\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 [/math]
[math]\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 [/math]

As we can see, the second term does not depend on [math]\displaystyle g(x) [/math]. Therefore to minimize [math]\displaystyle Var[w(x)] [/math] we need to minimize the first term. We will use Jensen's Inequality.

Jensen's Inequality
If [math]\displaystyle g [/math] is a convex function ( twice differentiable and [math]\displaystyle g''(x) \geq 0 [/math] ) then [math]\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)[/math]
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:

[math]\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) [/math] where [math] \alpha_1 + \alpha_2 + ... + \alpha_n = 1 [/math]

Using Jensen's Inequality,

[math]\displaystyle g(E[x]) \leq E[g(x)] [/math] as [math]\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \lt = p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] [/math]
[math]\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 [/math]
[math]\displaystyle E[(w(x))^2] \geq \left( E[\int \left| \frac{f(x)h(x)}{g(x)} \right|] \right)^2 [/math]
[math]\displaystyle \left(E[\int \left| \frac{f(x)h(x)}{g(x)} \right|] \right)^2 [/math]
[math]\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 [/math]
[math]\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 [/math] since [math]\displaystyle f [/math] and [math]\displaystyle g[/math] are density functions, [math]\displaystyle f, g [/math] cannot be negative.

Thus, this is a lower bound on [math]\displaystyle E[(w(x))^2][/math]. If we replace [math]\displaystyle g^*(x) [/math] into [math]\displaystyle E[g^*(x)][/math], we can see that the result is as we require. Details omitted.

However, in practice, it is difficult to compute [math]\displaystyle g^*[/math].

Remark 3

Choose [math]\displaystyle g [/math] such that it is similar to [math]\displaystyle \left| h(x) \right| f(x) [/math] in terms of shape. That is, we want [math]\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) [/math]

Continuing on - June 4, 2009