a Deeper Look into Importance Sampling

From statwiki
Revision as of 16:13, 8 June 2009 by L3schaef (talk | contribs) (Cleaned up latex, added absolute value.)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

A Deeper Look into Importance Sampling - June 2, 2009

From last class, we have determined that an integral can be written in the form [math]\displaystyle{ I = \displaystyle\int h(x)f(x)\,dx }[/math] [math]\displaystyle{ = \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx }[/math] We continue our discussion of Importance Sampling here.

Importance Sampling

We can see that the integral [math]\displaystyle{ \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx }[/math] is just [math]\displaystyle{ \displaystyle E_g(h(x)) \rightarrow }[/math]the expectation of h(x) with respect to g(x), where [math]\displaystyle{ \displaystyle \frac{f(x)}{g(x)} }[/math] is a weight [math]\displaystyle{ \displaystyle\beta(x) }[/math]. In the case where [math]\displaystyle{ \displaystyle f \gt g }[/math], a greater weight for [math]\displaystyle{ \displaystyle\beta(x) }[/math] will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.

Problem

The method of Importance Sampling is simple but can lead to some problems. The [math]\displaystyle{ \displaystyle \hat I }[/math] estimated by Importance Sampling could have infinite standard error.

Given [math]\displaystyle{ \displaystyle I= \int w(x) g(x) dx }[/math] [math]\displaystyle{ = \displaystyle E_g(w(x)) }[/math] [math]\displaystyle{ = \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) }[/math] where [math]\displaystyle{ \displaystyle w(x)=\frac{f(x)h(x)}{g(x)} }[/math].

Obtaining the second moment,

[math]\displaystyle{ \displaystyle E[(w(x))^2] }[/math]
[math]\displaystyle{ \displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx }[/math]
[math]\displaystyle{ \displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx }[/math]
[math]\displaystyle{ \displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx }[/math]

We can see that if [math]\displaystyle{ \displaystyle g(x) \rightarrow 0 }[/math], then [math]\displaystyle{ \displaystyle E[(w(x))^2] \rightarrow \infty }[/math]. This occurs if [math]\displaystyle{ \displaystyle g }[/math] has a thinner tail than [math]\displaystyle{ \displaystyle f }[/math] then [math]\displaystyle{ \frac{h^2(x)f^2(x)}{g(x)} }[/math] could be infinitely large. The general idea here is that [math]\displaystyle{ \frac{f(x)}{g(x)} }[/math] should not be large.

Remark 1

It is evident that [math]\displaystyle{ \displaystyle g(x) }[/math] should be chosen such that it has a thicker tail than [math]\displaystyle{ \displaystyle f(x) }[/math]. If [math]\displaystyle{ \displaystyle f }[/math] is large over set [math]\displaystyle{ \displaystyle A }[/math] but [math]\displaystyle{ \displaystyle g }[/math] is small, then [math]\displaystyle{ \displaystyle \frac{f}{g} }[/math] would be large and it would result in a large variance.

Remark 2

It is useful if we can choose [math]\displaystyle{ \displaystyle g }[/math] to be similar to [math]\displaystyle{ \displaystyle f }[/math] in terms of shape. Analytically, we can show that there is a [math]\displaystyle{ \displaystyle g }[/math] that would result in a variance that is minimized.

Theorem: The choice of [math]\displaystyle{ \displaystyle g }[/math] that minimizes variance of [math]\displaystyle{ \hat I }[/math] is [math]\displaystyle{ \displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds} }[/math].

Proof:
We know that [math]\displaystyle{ \displaystyle w(x)=\frac{f(x)h(x)}{g(x)} }[/math]

The variance of [math]\displaystyle{ \displaystyle w(x) }[/math] is

[math]\displaystyle{ \displaystyle Var[w(x)] }[/math]
[math]\displaystyle{ \displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 }[/math]
[math]\displaystyle{ \displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 }[/math]
[math]\displaystyle{ \displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 }[/math]

As we can see, the second term does not depend on [math]\displaystyle{ \displaystyle g(x) }[/math]. Therefore to minimize [math]\displaystyle{ \displaystyle Var[w(x)] }[/math] we need to minimize the first term. We will use Jensen's Inequality.

Jensen's Inequality
If [math]\displaystyle{ \displaystyle g }[/math] is a convex function ( twice differentiable and [math]\displaystyle{ \displaystyle g''(x) \geq 0 }[/math] ) then [math]\displaystyle{ \displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2) }[/math]
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:

[math]\displaystyle{ \displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) }[/math] where [math]\displaystyle{ \alpha_1 + \alpha_2 + ... + \alpha_n = 1 }[/math]

Using Jensen's Inequality,

[math]\displaystyle{ \displaystyle g(E[x]) \leq E[g(x)] }[/math] as [math]\displaystyle{ \displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \lt = p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] }[/math]
[math]\displaystyle{ \displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 }[/math]
[math]\displaystyle{ \displaystyle E[(w(x))^2] \geq \left( E[\int \left| \frac{f(x)h(x)}{g(x)} \right|] \right)^2 }[/math]
[math]\displaystyle{ \displaystyle \left(E[\int \left| \frac{f(x)h(x)}{g(x)} \right|] \right)^2 }[/math]
[math]\displaystyle{ \displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 }[/math]
[math]\displaystyle{ \displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 }[/math] since [math]\displaystyle{ \displaystyle f }[/math] and [math]\displaystyle{ \displaystyle g }[/math] are density functions, [math]\displaystyle{ \displaystyle f, g }[/math] cannot be negative.

Thus, this is a lower bound on [math]\displaystyle{ \displaystyle E[(w(x))^2] }[/math]. If we replace [math]\displaystyle{ \displaystyle g^*(x) }[/math] into [math]\displaystyle{ \displaystyle E[g^*(x)] }[/math], we can see that the result is as we require. Details omitted.

However, in practice, it is difficult to compute [math]\displaystyle{ \displaystyle g^* }[/math].

Remark 3

Choose [math]\displaystyle{ \displaystyle g }[/math] such that it is similar to [math]\displaystyle{ \displaystyle \left| h(x) \right| f(x) }[/math] in terms of shape. That is, we want [math]\displaystyle{ \displaystyle g \propto \displaystyle \left| h(x) \right| f(x) }[/math]

Continuing on - June 4, 2009